2

I want to use a regular expression that would match the pattern 'ATATAT' (of any length) and/or 'GCCGCCGCC' (again of any length) in a text file. I have only four options and one of them should work, but I have tried all of them on a text file containing those patterns several times. But any of the codes below either don't return anything or end up in an error: "grep: Invalid back reference". Maybe I shouldn't be using grep at all?

  • [ATGC]{2,}
  • ([ATGC]{2,})\1+
  • ([ATGC]{2,}){2,}
  • ([ATGC])\1+

Principally, the code I am using is the following:

grep 'one_of_the_patterns_above' DNA_sequence_file.fasta

And the file looks something like this:

>sampled sequence 1 consisting of 500 bases.
GCAAAGTAGCCGAGGTCAGGGCATGTCAATGATAGCGCGAAAAGGTCACCACGAGAAGCG
GCACTCGGCCACGGATTGGTGGCACTTCATATGGAAACGCGACGACCGATAAAAACACAA
CGAAACCCAATTGGAATGAGATTTTCCTGAAACCGCAGCGAACCCAACCAAGCGGGAATA
AAGTCGGGAAGTCTAAACGAGATTAGCAGAATCCACCTCAGAATGACTGATGCCATGTAG
GCGCAGCAATAGATTACCGAAAGAGAAACACAGCAACGGATACATACAACTCAAGGGAAG
AGCACCTTTCGCTGAGAGGAGACGCCTTACAAACTATCCAGGGGTTTGAACAAGACAGGT
CGAAAAGCGGCCCTCTTCACAACCAGGTCAAGCGCGACTCGAGACAAGTATTCCCAAAGT
CCAAAAAAGAATCCTACAGAATCCCATCAAAGCATTTGTAGAAAGACATGGCCTACCAGC
TGCGCAAAGGACACATTACC
2
  • Are you playing with dns sequences? :-) Wow! :-) Commented Feb 18, 2017 at 19:42
  • @peterh Yes, exactly! Sorry, I actually forgot to post my code and a small file sample.. I will do that now. Commented Feb 18, 2017 at 19:51

2 Answers 2

2

It looks like you want to match "AT" repeated at least twice, or, in your other example, "GCC" repeated at least twice. Those would be, respectively:

(AT){2,}
(GCC){2,}

Note that you will have to use grep -E for these patterns to match. (There isn't a single, consistent syntax for regular expressions that works identically across tools, so you may have to adapt depending on which you end up using.)

1
  • Wow, this seems to be exactly what I forgot to have in my code; the extended expression.. Thanks a lot!!! Commented Feb 18, 2017 at 19:53
-1

All of the patterns are bad, they match any ATGC in any order. The correct regexp is:

^((AT)*|(GCC)*)$

This is doing what you wrote.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.