incomplete descended taxid output of gimme_taxa.py #102

444thLiao · 2019-11-06T14:39:24Z

It may be a bug for someone who wants to download a whole phylum/order ？

In doc of ete3, ncbiTaxa should pass into a param intermediate_nodes=True to output a complete descended taxid of given taxid.

Because the API of ete3 lack the part of NCBITaxa, so maybe we should check the source code of ete3

In simple, it should add a param into line 99 of gimme_taxa.py.

jrjhealey · 2019-11-06T15:21:08Z

Can you elaborate, perhaps with some examples?

My immediate thought is that Intermediate node names are not necessary as far as I'm aware, as the assembly summary file is only aware of TaxIDs up to, I think, species level and so ngd cannot do anything with an order or phylum taxid etc. I'm happy to be proven wrong though. As you say, it wouldnt be a difficult change, and I think, shouldn't alter the current output negatively, I'm just not sure what difference it will make without testing further. If you want to download a whole order, you just need to get all the descendant taxa of the taxid one tier above it.

The docs don't seem to show intermediate_nodes as necessary for getting all the descendants of a specific internal node though?

444thLiao · 2019-11-08T07:11:38Z

Yes, it is also as same as my previous thought.
But when I download genus Bradyrhizobium or order Rhizobiales, I found that this might miss some genome data.

First, When I want to download genus Bradyrhizobium, this is easy because for genus, it will contain its as organism name at genbank_bacteria_assembly_summary.txt. So I download exactly 253 genomes.

But lately, I want to download a whole order Rhizobiales which must include all genomes of previous genus Bradyrhizobium. I use gimme_taxa.py to run below command.

python3 ~/script/ncbi-genome-download/contrib/gimme_taxa.py -o ~/tmp/Rhizobiales_all.txt 356

I will get descendant taxids in Rhizobiales_all.txt. But when I give the result for other, she found out that the number of genome of Bradyrhizobium only nearly 200 instead of exact 253.

When I check it carefully, that also the problem which raise this bugs/mistake?
For the design of ncbi, some genomes are placed under the taxid of genus/species because this is an unresolved exact taxonomy genome. If this species contains below more detailed taxid, the taxid of this species would be taken as an intermediate node which will finally missed at current gimme_taxa.py.

Aug	SEP	Oct
	06
2019	2020	2021

kblin / ncbi-genome-download

incomplete descended taxid output of gimme_taxa.py #102

incomplete descended taxid output of gimme_taxa.py #102

444thLiao commented Nov 6, 2019

jrjhealey commented Nov 6, 2019

444thLiao commented Nov 8, 2019 •

edited

kblin / ncbi-genome-download

Join GitHub today

incomplete descended taxid output of gimme_taxa.py #102

incomplete descended taxid output of gimme_taxa.py #102

Comments

444thLiao commented Nov 6, 2019

jrjhealey commented Nov 6, 2019

444thLiao commented Nov 8, 2019 • edited

444thLiao commented Nov 8, 2019 •

edited