3

I have a requirement where I have the alphabet 'ACGT' and I need to create a string of around 20,000 characters. This string should contain 100+ occurrences of the pattern "CCGT". Most of the time the generated string contains it around 20-30 instances.

    int N = 20000;
    std::string alphabet("ACGT");
    std::string str;
    str.reserve(N);
    for (int index = 0; index < N; index++)
    {
        str += alphabet[rand() % (alphabet.length())];
    }

How do I tweak the code so that the pattern would appear more often?

Edit - Is there a way of changing the alphabet, i.e - 'A', 'C', 'G', 'T', 'CCGT' as characters of the alphabet?

Thank you.

14
  • 1
    Every N / 4 / 100 iteration, insert the special sequence, while skipping four characters in index? Commented Apr 23, 2015 at 10:37
  • The problem seems to be rand, I'd recommend looking at these alternatives. Commented Apr 23, 2015 at 10:37
  • @Jefffrey I think the OP wants to weigh the randomness to generate a special sequence more often than the other, which is totally different than the duplicate, so I vote to reopen. Commented Apr 23, 2015 at 10:40
  • 1
    How can you have more than one probability of a substring? I think you meant "occurrance". And how can it be random if there must be at least 100 </pedant> Commented Apr 23, 2015 at 10:43
  • @Skizz Yeah sorry, that's what I meant Commented Apr 23, 2015 at 10:44

6 Answers 6

2

Generate an array of ints containing 100 x 0s and 490 1s, 2s, 3s and 4s [000000....111111....2222 etc] making almost 20,000 entries.

Then random shuffle it (std::random_shuffle)

Then write a string where each 0 translates to 'CCGT', each 1 translates to 'A', each 2 .... etc

I think that gives you what you want, and by tweaking the original array of ints you could change the number of 'A' characters in the output too.

Edit: If that isn't random enough, do 100 0s at the start and then random 1-4 for the rest.

Sign up to request clarification or add additional context in comments.

3 Comments

I believe it should be 400 of 1s - 4s, currently it adds up to 2360 characters.
well, 100 0s, and (20,000 - (100 * 4)) of the rest in total - I assume magic numbers won't appear in the real code either
This is almost the OPs original suggestion - it just simplifies the problem by treating a 'CCGT' as a single character until it gets to the end. You could as easily generate the string including 100 'X' characters and then substitute afterwards, although I think shuffling is the easiest way to get 100.
1

The only solution I can think of that would meet the "100+" criteria is:

create 20000 character string
number of instances (call it n) = 100 + some random value
for (i = 0 ; i < n ; ++i)
{
   pick random start position
   write CCGT
}

Of course, you'd need to ensure the overwritten characters weren't part of a "CCGT" already.

Comments

1

My first thought would be to generate a list of 100 indexes where you will definitely insert the special string. Then as you generate the random string, insert the special string at each of these indexes as you reach them.

I've missed out checking that the intervals are spaced appropriately (cannot be within 4 of another interval) and also sorting them in ascending order - both of which would be necessary for this to work.

int N = 20000;
std::string alphabet("ACGT");
int intervals[100];
for (int index = 0; index < 100; index)
{
    intervals[index] = rand() % 2000;
    // Do some sort of check to make sure each element of intervals is not
    // within 4 of another element and that no elements are repeated
}
// Sort the intervals array in ascending order
int current_interval_index = 0;
std::string str;
str.reserve(N);
for (int index = 0; index < N; index++)
{
    if (index == intervals[current_interval_index])
    {
        str += alphabet;
        current_interval_index++;
        index += 3;
    }
    else
    {
        str += alphabet[rand() % (alphabet.length())];
    }
}

Comments

1

The solution I came up with uses a std::vector to contain all the random sets of 4 chars including the 100 special sequence. I then shuffle that vector to distribute the 100 special sequence randomly throughout the string.

To make the distribution of letters even I create an alternative alphabet string called weighted that contains a relative abundance of alphabet characters according to what has already been included from the 100 special sequence.

int main()
{
    std::srand(std::time(0));

    using std::size_t;

    const size_t N = 20000;

    std::string alphabet("ACGT");

    // stuff the ballot
    std::vector<std::string> v(100, "CCGT");

    // build a properly weighted alphabet string
    // to give each letter equal chance of appearing
    // in the final string

    std::string weighted;

    // This could be scaled down to make the weighted string much smaller

    for(size_t i = 0; i < (N - 200) / 4; ++i) // already have 200 Cs
        weighted += "C";

    for(size_t i = 0; i < (N - 100) / 4; ++i) // already have 100 Ns & Gs
        weighted += "GT";

    for(size_t i = 0; i < N / 4; ++i)
        weighted += "A";

    size_t remaining = N - (v.size() * 4);

    // add the remaining characters to the weighted string
    std::string s;
    for(size_t i = 0; i < remaining; ++i)
        s += weighted[std::rand() % weighted.size()];

    // add the random "4 char" sequences to the vector so
    // we can randomly distribute the pre-loaded special "4 char"
    // sequence among them.
    for(std::size_t i = 0; i < s.size(); i += 4)
        v.push_back(s.substr(i, 4));

    // distribute the "4 char" sequences randomly
    std::random_shuffle(v.begin(), v.end());

    // rebuild string s from vector
    s.clear();
    for(auto&& seq: v)
        s += seq;

    std::cout << s.size() << '\n'; // should be N
}

Comments

1

I like the answer by @Andy Newman and think that is probably the best way - the code below is a compilable example of what they suggested.

#include <string>
#include <algorithm>
#include <iostream>

int main()
{
    srand(time(0));
    int N = 20000;
    std::string alphabet("ACGT");
    std::string str;
    str.reserve(N);
    int int_array[19700];
    // Populate int array
    for (int index = 0; index < 19700; index++)
        {
        if (index < 100)
        {
            int_array[index] = 0;
        }
        else
        {
            int_array[index] = (rand() % 4) + 1;
        }
    }
    // Want the array in a random order
    std::random_shuffle(&int_array[0], &int_array[19700]);
    // Now populate string from the int array
    for (int index = 0; index < 19700; index++)
    {
        switch(int_array[index])
        {
            case 0:
                str += alphabet;
                break;
            case 1:
                str += 'A';
                break;
            case 2:
                str += 'C';
                break;
            case 3:
                str += 'G';
                break;
            case 4:
                str += 'T';
                break;
            default:
                break;
        }
    }
    // Print out to check what it looks like
    std::cout << str << std::endl;
}

Comments

1

You should make N larger.

I take this liberty because you say 'create a string of around 20,000 characters'; but there's more to it than that.

If you're only finding around 20-30 instances in a string of 20000 characters then something is wrong. A ballpark estimate is to say that there are 20000 character positions to test, and at each of these there will be a four-letter string from an alphabet of four letters, giving a 1/256 chance of it being a specific string. The average should be (approximately; because I've oversimplified) 20000/256, or 78 hits.

It could be that your string isn't properly randomised (likely due to the use of the modulo idiom), or perhaps you're testing only every fourth character position -- as if the string were a list of non-overlapping four-letter words.

If you can bring your average hit rate back up to 78, then you can reach a little further to your 100 requirement simply by increasing N proportionally.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.