2

I'm looking to obtain organism name from a fasta header file, where I'm interested in from the description to extract when OS=(Organism Name).

FASTA HEADER
>sp|Q8T8B9|ACMSD_CAEEL 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Caenorhabditis elegans GN=acsd-1 PE=2 SV=1
MPICEFSATSKSRKIDVHAHVLPKNIPDFQEKFGYPGFVRLDHKEDGTTHMVKDGKLFRV
VEPNCFDTETRIADMNRANVNVQCLSTVPVMFSYWAKPADTEIVARFVNDDLLAECQKFP
GKEHIVLGTDYPFPLGEL
EVGRVVEEYKPFSAKDREDLLWKNAVKMLDIDENLLFNKDF
>sp|P34455|ACON_CAEEL Probable aconitate hydratase, mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2
MNSLLRLSHLAGPAHYRALHSSSSIWSKVAISKFEPKSYLPYEKLSQTVKIVKDRLKRPL
TLSEKILYGHLDQPKTQDIERGVSYLRLRPDRVAMQDATAQMAMLQFISSGLPKTAVPST
IHCDHLIEAQKGGAQDLARAKDLNKEVFNFLATAGSKYGVGFWKPGSGIIHQIILENYAF
Code for Obtaining FastaHeader
from Bio import SeqIO
import re
import pandas as pd


input_file = "ANIMAL.fasta" 

fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
    fasta_id, sequence = fasta.id, str(fasta.seq)
    print(fasta.description)

Current Output:

>sp|Q8T8B9|ACMSD_CAEEL 2-amino-3-carboxymuconate-6-semialdehyde decarboxylase OS=Caenorhabditis elegans GN=acsd-1 PE=2 SV=1

>sp|P34455|ACON_CAEEL Probable aconitate hydratase, mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2

Desired Output:

Caenorhabditis elegans
Caenorhabditis elegans
2
  • did you try regex? Commented Sep 16, 2020 at 14:53
  • cross posted : biostars.org/p/461697 Commented Sep 16, 2020 at 16:24

1 Answer 1

2

You can search for your information using a regex:

import re
example = "sp|P34455|ACON_CAEEL Probable aconitate hydratase, mitochondrial OS=Caenorhabditis elegans GN=aco-2 PE=3 SV=2"

start = re.search("OS", example).start()
result = example[start+3:].split("GN")[0].strip()
print(result)
>> Caenorhabditis elegans

This Code looks for the text after "OS=" until "GN" and removes the whitespaces at the end

Sign up to request clarification or add additional context in comments.

1 Comment

very nice approach, thanks @yannick! :) the .start() just Return the indices of the start position where the "OS" string is correct?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.