Home

Awesome

NCBItax2lin

Downloads

Convert NCBI taxonomy dump into lineages. An example for human (tax_id=9606) is like

tax_idsuperkingdomphylumclassorderfamilygenusspeciesfamily1formagenus1infraclassinfraorderkingdomno rankno rank1no rank10no rank11no rank12no rank13no rank14no rank15no rank16no rank17no rank18no rank19no rank2no rank20no rank21no rank22no rank3no rank4no rank5no rank6no rank7no rank8no rank9parvorderspecies groupspecies subgroupspecies1subclasssubfamilysubgenussubkingdomsubordersubphylumsubspeciessubtribesuperclasssuperfamilysuperordersuperorder1superphylumtribevarietas
9606EukaryotaChordataMammaliaPrimatesHominidaeHomoHomo sapiensSimiiformesMetazoacellular organismsOpisthokontaDipnotetrapodomorphaTetrapodaAmniotaTheriaEutheriaBoreoeutheriaEumetazoaBilateriaDeuterostomiaVertebrataGnathostomataTeleostomiEuteleostomiSarcopterygiiCatarrhiniHomininaeHaplorrhiniCraniataHominoideaEuarchontoglires

Install

ncbitax2lin supports python-3.7, python-3.8, and python-3.9.

pip install -U ncbitax2lin

Generate lineages

First download taxonomy dump from NCBI:

wget -N ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
mkdir -p taxdump && tar zxf taxdump.tar.gz -C ./taxdump

Then, run ncbitax2lin

ncbitax2lin --nodes-file taxdump/nodes.dmp --names-file taxdump/names.dmp

By default, the generated lineages will be saved to ncbi_lineages_[date_of_utcnow].csv.gz. The output file can be overwritten with --output option.

FAQ

Q: I have a large number of sequences with their corresponding accession numbers from NCBI, how to get their lineages?

A: First, you need to map accession numbers (GI is deprecated) to tax IDs based on nucl_*accession2taxid.gz files from ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/. Secondly, you can trace a sequence's whole lineage based on its tax ID. The tax-id-to-lineage mapping is what NCBItax2lin can generate for you.

If you have any question about this project, please feel free to create a new issue.

Note on taxdump.tar.gz.md5

It appears that NCBI periodically regenerates taxdump.tar.gz and taxdump.tar.gz.md5 even when its content is still the same. I am not sure how their regeneration works, but taxdump.tar.gz.md5 will differ simply because of a different timestamp.

Used in

Development

Install dependencies

poetry shell
poetry install

Testing

make format
make all

Publish (only for administrator)

poetry version [minor/major etc.]
poetry publish --build