1

I am dealing with short strings (they're DNA sequences) of ~30 length. For my purposes, every 5th position needs to be swapped for any of the 4 DNA bases (A, C, T, G). e.g. if I have an input of AAAAAAAAAAAAAA the output would be a list of:

AAAAAAAAAAAAAA
AAAACAAAAAAAAA
AAAATAAAAAAAAA
AAAAGAAAAAAAAA
AAAACAAAACAAAA
AAAACAAAATAAAA
....

That is, every 5th position is individually swapped for A,C,T or G, to generate an array of all possible sequences where each 5th position is all possible DNA bases.

I have been attempting this with for loops, and can edit each 5th position, but not in a combinatorial approach

e.g.

echo "AAAAAAAAAAAAAAA" > one.spacer 
for i in $(seq 1 3)
  do
    for base in {a,c,t,g}
      do
       awk -v b=$base -v x=$i '{print substr ($0,1,5*x-1) b substr ($0,5*x+1,100)}' one.spacer
    done
done

gives the output:

AAAAaAAAAAAAAAA
AAAAcAAAAAAAAAA
AAAAtAAAAAAAAAA
AAAAgAAAAAAAAAA
AAAAAAAAAaAAAAA
AAAAAAAAAcAAAAA
AAAAAAAAAtAAAAA
AAAAAAAAAgAAAAA
AAAAAAAAAAAAAAa
AAAAAAAAAAAAAAc
AAAAAAAAAAAAAAt
AAAAAAAAAAAAAAg

but hopefully you can see that this is edited only singly at each 5th position. I need list of sequences that will include, for example

AAAAgAAAAgAAAAg
AAAAcAAAAtAAAAa

as well as all the other combinations. Hopefully that's a little clearer

5
  • It wouldn't be a random sample, it would be every one of the options A,C,T,G at every 5th position. So, for example, if the original sequence were 10 characters long, you would end up with 16 output sequences (4 different options at position 5, combined with 4 options at position 10) Commented Aug 15, 2021 at 18:48
  • 1
    @catchprj: Your description above and input provided doesn't match. Update a concise example of your input and an exact output needed, with whatever attempts that you had attempted that didn't work Commented Aug 15, 2021 at 18:59
  • I have edited the original question. Hopefully it's a little clearer, though I fear it may be more confusing now! Commented Aug 15, 2021 at 19:43
  • You probably want to use perl instead of awk for this, specifically bioperl - it wouldn't surprise me if there was an existing library function to generate all possible permutations of a DNA sequence according to user-definable criteria/patterns (although I can't recall or find one right now). Also, while questions about processing DNA sequences are welcome here, U&L has a sister site specifically for bioinformatics questions. Commented Aug 16, 2021 at 4:17
  • The input word has 14 characters, but the words in your expected output have 15, which clashes with your specification and the clarifications in your comment above. Commented Aug 16, 2021 at 14:52

3 Answers 3

2

This will run in a fraction of a second even for your real 30-char width input using any awk in any shell on every Unix box:

$ cat tst.awk
function mutate(old,lgth,       new,i,j) {
    for (i=5; i<=lgth; i+=5) {
        for (j=1; j<=4; j++) {
            new = substr(old,1,i-1) substr("ACTG",j,1) substr(old,i+1)
            if ( !seen[new]++ ) {
                print new
                mutate(new,lgth)
            }
        }
    }
}

{ mutate($0,length($0)) }

$ echo 'AAAAAAAAAAAAAAA' | awk -f tst.awk
AAAAAAAAAAAAAAA
AAAACAAAAAAAAAA
AAAATAAAAAAAAAA
AAAAGAAAAAAAAAA
AAAAGAAAACAAAAA
AAAAAAAAACAAAAA
AAAACAAAACAAAAA
AAAATAAAACAAAAA
AAAATAAAATAAAAA
AAAAAAAAATAAAAA
AAAACAAAATAAAAA
AAAAGAAAATAAAAA
AAAAGAAAAGAAAAA
AAAAAAAAAGAAAAA
AAAACAAAAGAAAAA
AAAATAAAAGAAAAA
AAAATAAAAGAAAAC
AAAAAAAAAGAAAAC
AAAACAAAAGAAAAC
AAAAGAAAAGAAAAC
AAAAGAAAAAAAAAC
AAAAAAAAAAAAAAC
AAAACAAAAAAAAAC
AAAATAAAAAAAAAC
AAAATAAAACAAAAC
AAAAAAAAACAAAAC
AAAACAAAACAAAAC
AAAAGAAAACAAAAC
AAAAGAAAATAAAAC
AAAAAAAAATAAAAC
AAAACAAAATAAAAC
AAAATAAAATAAAAC
AAAATAAAATAAAAT
AAAAAAAAATAAAAT
AAAACAAAATAAAAT
AAAAGAAAATAAAAT
AAAAGAAAAAAAAAT
AAAAAAAAAAAAAAT
AAAACAAAAAAAAAT
AAAATAAAAAAAAAT
AAAATAAAACAAAAT
AAAAAAAAACAAAAT
AAAACAAAACAAAAT
AAAAGAAAACAAAAT
AAAAGAAAAGAAAAT
AAAAAAAAAGAAAAT
AAAACAAAAGAAAAT
AAAATAAAAGAAAAT
AAAATAAAAGAAAAG
AAAAAAAAAGAAAAG
AAAACAAAAGAAAAG
AAAAGAAAAGAAAAG
AAAAGAAAAAAAAAG
AAAAAAAAAAAAAAG
AAAACAAAAAAAAAG
AAAATAAAAAAAAAG
AAAATAAAACAAAAG
AAAAAAAAACAAAAG
AAAACAAAACAAAAG
AAAAGAAAACAAAAG
AAAAGAAAATAAAAG
AAAAAAAAATAAAAG
AAAACAAAATAAAAG
AAAATAAAATAAAAG
1
  • 1
    This is perfect, and certainly not something I could have come up with myself. Thank you! Commented Aug 16, 2021 at 19:36
1

That goes quite a bit against what one would consider good shell coding practice, is likely to be inefficient, and wouldn't scale well to large inputs, but for brevity, with the ksh93 shell and assuming the default value of $IFS, you could do:

words=($(<your-file))
printf '%s\n' ${words[@]//{4}(?)?/\1{A,C,T,G}}

With the ${var//pattern/replacement}, we're replacing every sequence of 4 character + 1, with the 4 characters and {A,C,T,G} which in ksh ends up expanded as per csh brace expansion upon unquoted parameter expansion like that.

0

The itertools module in Python has a lot of methods to deal with such combinatorial problems.

python3 - <<\eof
import itertools as it

dna = 'atcg'
step = 5

with open('yourfile') as f:
  for _ in f:
    l = _.rstrip('\n')
    w = len(l)
    I = [i for i in range(step-1,w,step)]
    for t1 in  it.product(dna,repeat=int(w/step)):
      t = list(t1)[::-1]
      print(*[
        t.pop(0) if idx in I else e
        for idx,e in enumerate(l)],sep="")
eof
  • From the iterations module, the method product generates a Cartesian product of the iterations inputted, in our case tne dna sequence multiple times.
  • We turn it into an infinite iterator that never ends and recycled from start once the number of Cartesian products is reached whilst the input file still has data in it.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.