I've recently developed two functions to functions to essentially convert a list of strings that look something like this (these strings are 101 characters long in my case):
['AGT', 'AAT']
To a numpy array:
array([[[[1],
         [0],
         [0],
         [0]],
        [[0],
         [1],
         [0],
         [0]],
        [[0],
         [0],
         [0],
         [1]]],
       [[[1],
         [0],
         [0],
         [0]],
        [[1],
         [0],
         [0],
         [0]],
        [[0],
         [0],
         [1],
         [0]]]])
The shape of which is [2, 3, 4, 1] in this case
At the moment, my code essentially defines one function, in which I define a dictionary, which is then mapped to a single input string, like so:
def sequence_one_hot_encoder(seq):
        
        import numpy as np
        mapping = {
        "A": [[1], [0], [0], [0]],
        "G": [[0], [1], [0], [0]],
        "C": [[0], [0], [1], [0]],
        "T": [[0], [0], [0], [1]],
        "X": [[0], [0], [0], [0]],
        "N": [[1], [1], [1], [1]]
        }
        encoded_seq = np.array([mapping[i] for i in str(seq)])
        return(encoded_seq)
Following from this, I then create another function to map this function to my list of strings:
def sequence_list_encoder(sequence_file):
    
    import numpy as np
    
    one_hot_encoded_array = np.asarray(list(map(sequence_one_hot_encoder, sequence_file)))
    print(one_hot_encoded_array.shape)
    return(one_hot_encoded_array)
At the moment, for a list containing 1,688,119 strings of 101 characters, it's taking around 7-8 minutes. I was curious if there was a better way of rewriting my two functions to reduce runtime?
