2
\$\begingroup\$

I've recently developed two functions to functions to essentially convert a list of strings that look something like this (these strings are 101 characters long in my case):

['AGT', 'AAT']

To a numpy array:

array([[[[1],
         [0],
         [0],
         [0]],

        [[0],
         [1],
         [0],
         [0]],

        [[0],
         [0],
         [0],
         [1]]],

       [[[1],
         [0],
         [0],
         [0]],

        [[1],
         [0],
         [0],
         [0]],

        [[0],
         [0],
         [1],
         [0]]]])

The shape of which is [2, 3, 4, 1] in this case

At the moment, my code essentially defines one function, in which I define a dictionary, which is then mapped to a single input string, like so:

def sequence_one_hot_encoder(seq):
        
        import numpy as np

        mapping = {
        "A": [[1], [0], [0], [0]],
        "G": [[0], [1], [0], [0]],
        "C": [[0], [0], [1], [0]],
        "T": [[0], [0], [0], [1]],
        "X": [[0], [0], [0], [0]],
        "N": [[1], [1], [1], [1]]
        }

        encoded_seq = np.array([mapping[i] for i in str(seq)])
        return(encoded_seq)

Following from this, I then create another function to map this function to my list of strings:

def sequence_list_encoder(sequence_file):
    
    import numpy as np
    
    one_hot_encoded_array = np.asarray(list(map(sequence_one_hot_encoder, sequence_file)))
    print(one_hot_encoded_array.shape)
    return(one_hot_encoded_array)

At the moment, for a list containing 1,688,119 strings of 101 characters, it's taking around 7-8 minutes. I was curious if there was a better way of rewriting my two functions to reduce runtime?

\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

sequence_one_hot_encoder(seq) builds an array of shape (len(seq), 4, 1). sequence_list_encoder() puts all these into a python list and then coverts the list into an array with shape (number_of_sequences, len(seq), 4, 1). It looks like there is a lot of overhead doing that. It is much faster to treat the one_hot_encoded_array as 1-D and then set the shape at the end.

def sequence_list_encoder(sequence_file):
    mapping = {
        "A": (1, 0, 0, 0),
        "G": (0, 1, 0, 0),
        "C": (0, 0, 1, 0),
        "T": (0, 0, 0, 1),
        "X": (0, 0, 0, 0),
        "N": (1, 1, 1, 1)
        }

    sequences = sequence_file.read().splitlines()
    bits = [b for seq in sequences for ch in seq for b in mapping[ch]]

    one_hot_encoded_array = np.fromiter(bits, dtype=np.uint8)
    one_hot_encoded_array.shape = (len(sequences), len(sequences[0]), 4, 1)

    return one_hot_encoded_array

This runs in about 1/5 the time as your code.

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.