The Wayback Machine - https://web.archive.org/web/20200906142032/https://github.com/kblin/ncbi-genome-download/issues/102
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incomplete descended taxid output of gimme_taxa.py #102

Open
444thLiao opened this issue Nov 6, 2019 · 2 comments
Open

incomplete descended taxid output of gimme_taxa.py #102

444thLiao opened this issue Nov 6, 2019 · 2 comments

Comments

@444thLiao
Copy link
Contributor

@444thLiao 444thLiao commented Nov 6, 2019

It may be a bug for someone who wants to download a whole phylum/order ?

In doc of ete3, ncbiTaxa should pass into a param intermediate_nodes=True to output a complete descended taxid of given taxid.

Because the API of ete3 lack the part of NCBITaxa, so maybe we should check the source code of ete3

In simple, it should add a param into line 99 of gimme_taxa.py.

@jrjhealey
Copy link
Contributor

@jrjhealey jrjhealey commented Nov 6, 2019

Can you elaborate, perhaps with some examples?

My immediate thought is that Intermediate node names are not necessary as far as I'm aware, as the assembly summary file is only aware of TaxIDs up to, I think, species level and so ngd cannot do anything with an order or phylum taxid etc. I'm happy to be proven wrong though. As you say, it wouldnt be a difficult change, and I think, shouldn't alter the current output negatively, I'm just not sure what difference it will make without testing further. If you want to download a whole order, you just need to get all the descendant taxa of the taxid one tier above it.

The docs don't seem to show intermediate_nodes as necessary for getting all the descendants of a specific internal node though?

@444thLiao
Copy link
Contributor Author

@444thLiao 444thLiao commented Nov 8, 2019

Yes, it is also as same as my previous thought.
But when I download genus Bradyrhizobium or order Rhizobiales, I found that this might miss some genome data.

First, When I want to download genus Bradyrhizobium, this is easy because for genus, it will contain its as organism name at genbank_bacteria_assembly_summary.txt. So I download exactly 253 genomes.

But lately, I want to download a whole order Rhizobiales which must include all genomes of previous genus Bradyrhizobium. I use gimme_taxa.py to run below command.

python3 ~/script/ncbi-genome-download/contrib/gimme_taxa.py -o ~/tmp/Rhizobiales_all.txt 356

I will get descendant taxids in Rhizobiales_all.txt. But when I give the result for other, she found out that the number of genome of Bradyrhizobium only nearly 200 instead of exact 253.

When I check it carefully, that also the problem which raise this bugs/mistake?
For the design of ncbi, some genomes are placed under the taxid of genus/species because this is an unresolved exact taxonomy genome. If this species contains below more detailed taxid, the taxid of this species would be taken as an intermediate node which will finally missed at current gimme_taxa.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.