Home

Awesome

FamDB

Overview

FamDB is a modular HDF5-based export format and query tool developed for offline access to the Dfam database of transposable element and repetitive DNA families. FamDB stores family sequence models (profile HMMs, and consensus sequences), metadata including:

In addition, FamDB stores a subset of the NCBI Taxonomy relevant to the family taxa represented in the file, facilitating quick extraction of species/clade-specific family libraries. The query tool provides options for exporting search results in a variety of common formats including EMBL, FASTA, and HMMER HMM format. At this time FamDB is intended for use as a "read-only" data store by tools such as RepeatMasker as an alternative to unindexed EMBL or HMM files.

Installation/Setup

FamDB files follow a simple hierarchical structure based on the NCBI taxonomy tree. These files represent partitioned subsets of the Dfam database and contain data for related areas of the taxonomy tree.

Each installation comprises a required root file as well as a number of optional leaf files. The root file contains data for the higher levels of the taxonomy tree, data for less represented taxa (Fungi, Amoebas), and data for highly studied taxa (Mammals). The leaf files contain data for related taxa as shown in the table below.

The current partitions are:

NumberNameDescriptionRoot Taxon IDFile SizeRequired
0RootMammals, Microbes, Fungi, Jellies, & Sponges171Gb*
1ObtectomeraMoths & Butterflies104431125Gb
2EuteleosteomorphaBony Fish1489388118Gb
3SarcopterygiiReptiles, Amphibians, & Coelacanths828790Gb
4DipteraFlies714787Gb
5ViridiplantaePlants3309072Gb
6DeuterostomiaOther Fish & Echinoderms3351169Gb
7HymenopteraWasps, Bees, & Ants739963Gb
8EcdysozoaOther Arthropods & Roundworms1206794126Gb

All FamDB files follow the convention <export name>.<partition number>.h5 where the root partition is partition 0. All files from the same export should be kept in the same directory with no other exports present. The FamDB software will display a warning if files from different export or partitioning runs are present. The name of the directory containing the FamDB files is passed as an argument to famdb.py.

Dependencies

famdb.py

RepeatMasker includes a compatible version of famdb.py. This file should generally not be installed or upgraded manually.

FamDB can also be downloaded separately. At this time, only the file famdb.py is needed. The latest development version is located here: https://raw.githubusercontent.com/Dfam-consortium/FamDB/master/famdb.py

Usage

General usage is as follows:

famdb.py -i <directory> <command>

where <directory> is the folder holding the .h5 export files where <command> is one of info, names, lineage, families, or family.

info

Prints general information and statistics about the database, such as title, version, date, and count of consensus sequences and HMMs in the database.

names

Searches the taxonomy database for species names and prints all known names for any matches. The output is human-readable ("pretty") by default but can also be in JSON format. The JSON format is intended for parsing by scripts; the pretty format is too unstructured to parse reliably.

famdb.py -i ./dfam names [-f json] <term>

term can be a taxonomy identifier number or part of a species/clade name. In this example the FamDB files are stored in a directory called 'dfam' in the current working directory. RepeatMasker looks for them in its "Library/famdb" directory by default.

Exact matches are distinguished from non-exact matches. For example:

$ famdb.py -i ./dfam names rattus

Exact Matches
=============
10114 rat <Rattus> (common name), rats <Rattus> (common name), Rattus
(scientific name)

Non-exact Matches
=================
10115 Cape York rat (common name), mottle-tailed rat (genbank common name),
Rattus leucopus (scientific name)
10116 brown rat (common name), Buffalo rat (includes), laboratory rat
(includes), Norway rat (genbank common name), rat <Rattus norvegicus> (common
name), rats <Rattus norvegicus> (common name), Rattus norvegicus (scientific
name), Rattus PC12 clone IS (includes), Rattus sp. strain Wistar (includes),
Sprague-Dawley rat (includes), Wistar rats (includes), zitter rats (includes)
10117 black rat (genbank common name), house rat (common name), Rattus rattoides
<Rattus rattus> (synonym), Rattus rattoides (Pictet & Pictet, 1844) (authority),
Rattus rattus (scientific name), Rattus rattus (Linnaeus, 1758) (authority),
Rattus wroughtoni (synonym), Rattus wroughtoni Hinton, 1919 (authority), roof
rat (common name)
(...)

Other commands that take a taxonomy "term", including lineage and families, generally use the same search system as the names command. Most commands require a single exact match, or if there are no exact matches a single partial match. This requirement provides some leniency in name choice without allowing ambiguities.

If there are no matches for term, similar-sounding names will be suggested.

lineage

Prints the lineage tree for a species or clade with line drawing characters. The tree includes the number of families assigned to each clade. The options -a (ancestors) and/or -d (descendants) can be added to include ancestors, descendants, or both, as desired.

famdb.py -i ./dfam lineage [-a] [-d] [-f semicolon|totals] <term>

The semicolon-delimited format does not include the tree drawing and is more suitable for parsing. The totals format prints the number of ancestral and of lineage-specific repeats known for the given species. In the example below the count of families specifically assigned to each taxon is shown in brackets. The number in parentheses indicates which partition of FamDB contains the data for that taxon and its descendants.

$ famdb.py -i ./dfam lineage -ad humans

1 root(0) [0]
└─131567 cellular organisms(0) [0]
  └─2759 Eukaryota(0) [0]
    └─33154 Opisthokonta(0) [0]
      └─33208 Metazoa(0) [5]
        └─6072 Eumetazoa(0) [0]
          └─33213 Bilateria(0) [0]
            └─33511 Deuterostomia(0) [0]
              └─7711 Chordata(0) [0]
                └─89593 Craniata <chordates>(0) [0]
                  └─7742 Vertebrata <vertebrates>(0) [67]
                    └─7776 Gnathostomata <vertebrates>(0) [0]
                      └─117570 Teleostomi(0) [0]
                        └─117571 Euteleostomi(0) [1]
                          └─8287 Sarcopterygii(0) [0]
                            └─1338369 Dipnotetrapodomorpha(0) [0]
                              └─32523 Tetrapoda(0) [19]
                                └─32524 Amniota(0) [100]
                                  └─40674 Mammalia(0) [68]
                                    └─32525 Theria <mammals>(0) [69]
                                      └─9347 Eutheria(0) [387]
                                        └─1437010 Boreoeutheria(0) [40]
                                          └─314146 Euarchontoglires(0) [44]
                                            └─9443 Primates(0) [140]
                                              └─376913 Haplorrhini(0) [199]
                                                └─314293 Simiiformes(0) [56]
                                                  └─9526 Catarrhini(0) [104]
                                                    └─314295 Hominoidea(0) [23]
                                                      └─9604 Hominidae(0) [6]
                                                        └─207598 Homininae(0) [14]
                                                          └─9605 Homo(0) [0]
                                                            └─9606 Homo sapiens(0) [52]

family

Prints a single family given by the family accession.

famdb.py -i ./dfam family [-f <format>] <acc>

There are many formats to choose from:

$ famdb.py -i ./dfam family -f fasta_name DF0000001

>MIR @Mammalia [S:40,60,65]
ACAGTATAGCATAGTGGTTAAGAGCACGGGCTCTGGAGCCAGACTGCCTGGGTTCGAATC
CCGGCTCTGCCACTTACTAGCTGTGTGACCTTGGGCAAGTTACTTAACCTCTCTGTGCCT
CAGTTTCCTCATCTGTAAAATGGGGATAATAATAGTACCTACCTCATAGGGTTGTTGTGA
GGATTAAATGAGTTAATACATGTAAAGCGCTTAGAACAGTGCCTGGCACATAGTAAGCGC
TCAATAAATGTTAGCTATTATT

families

The families command takes a taxonomy term and prints all families assigned to that clade (optionally including ancestors and/or descendants) with optional additional filters.

famdb.py -i ./dfam families [-a] [-d] [--stage <st>] [--class <cl>] [--name <name>] [--curated] [-f <format>] [--add-reverse-complement] [--include-class-in-name] <term>

[-a], [-d] include ancestors/descendants as with lineage. The formats for [-f <format>] are the same as for family.

Filters:

Output options:

$ famdb.py -i ./dfam families -f embl_seq -a human