Taxonomic reference data

The assignTaxonomy and assignSpecies functions require appropriately formatted fasta files describing the set of taxonomically assigned sequences to use as a training dataset. We provide appropriately formatted versions of several popular taxonomic databases, and describe the dada2-specific format for those who wish to use a custom database.

DADA2-formatted reference databases

We maintain reference fastas for the three most common 16S databases: Silva, RDP and GreenGenes2. The dada2 package recognizes and parses the General Fasta releases of the UNITE project for ITS taxonomic assignment. Formatted versions of other databases can be “contributed” and will be made available through this page if referencable by doi (eg. deposited at Zenodo or Figshare).

Please note that the files provided here are just reformattings of these taxonomic databases. If using these files for taxonomic assignment, the source database should also be cited.

Maintained:

Silva version 138.2 (version 138.1, version 132, version 128, version 123)
NOTE: As of Silva version 138 the official DADA2-formatted reference fastas are optimized for classification of Bacteria and Archaea, and are not suitable for classifying Eukaryotes. Formatted versions of Silva suitable for Fungi are in the Contributed section below.
RDP Trainset 19 (trainset 18, trainset 16, trainset 14)
GreenGenes2 release 2024.09 (version 13.8 - outdated)
UNITE (use the General Fasta releases, “All eukaryotes”)

Contributed:

GTDB Version 202: Genome Taxonomy Database (Version 86 for assignTaxonomy and assignSpecies)
SILVA v138.2 18S reference for Fungi
NOTE: Use this instead of the Maintained version of the Silva reference if working with fungi.
SILVA v138.2 26S reference for Fungi
NOTE: Use this instead of the Maintained version of the Silva reference if working with fungi.
RefSeq + RDP (NCBI RefSeq 16S rRNA database supplemented by RDP)
- Reference files formatted for assignTaxonomy
- Reference files formatted for assignSpecies
HitDB version 1 (Human InTestinal 16S rRNA)
Human Oral Microbiome Database: HOMD
MiDAS: Field Guide to the Microbes of Activated Sludge and Anaerobic Digesters
MIDORI Reference 2 (for taxonomic assignments of Eukaryota mitochondrial DNA sequences)
RDP fungi LSU trainset 11
Silva Eukaryotic 18S, v132 & v128
nifH ARB, version 1
PR2 version 5.0.0+. SEE NOTE BELOW.

Note: PR2 has different taxLevels than the dada2 default. When assigning taxonomy against PR2, use the following: assignTaxonomy(..., taxLevels = c("Domain","Supergroup","Division","Subdivision", "Class","Order","Family","Genus","Species"))

Many thanks to the folks at RDP, Silva, GreenGenes, UNITE, GTDB, PR2 and others for making these amazing reference datbases available to the community. We created the dada2-compatible training fastas from the Silva NR99 and taxonomy files, the RDP trainset 18 database, and the GreenGenes 13.8 OTUs clustered at 97%.

The scripts to parse and produce the officially maintained training fastas are included as private functions within the dada2 R package. Inspect taxonomy.R to see the code. Currently training fastas for assignSpecies are only available for the Silva and RDP databases.

Formatting custom databases

Custom databases can be used with assignTaxonomy and assignSpecies provided they are in the dada2-compatible training fasta format. We thank the many contributors who have created custom databases suitable for use with DADA2, several of which are linked above.

assignTaxonomy(...) expects a training fasta file (or compressed fasta file) in which the taxonomy corresponding to each sequence is encoded in the id line in the following fashion (the second sequence is not assigned down to level 6):

>Level1;Level2;Level3;Level4;Level5;Level6;
ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC
>Level1;Level2;Level3;Level4;Level5;
CGCTAGAAAGTCGTAGAAGGCTCGGAGGTTTGAAGCATCGCCCGATGGGATCTCGTTGCTGTAGCATGAGTACGGACATTCAGGGATCATAGGATAC

assignSpecies(...) expects the training data to be provided in the form of a fasta file (or compressed fasta file), with the id line formatted as follows:

>ID Genus species
ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC
>ID Genus species
CGCTAGAAAGTCGTAGAAGGCTCGGAGGTTTGAAGCATCGCCCGATGGGATCTCGTTGCTGTAGCATGAGTACGGACATTCAGGGATCATAGGATAC

Maintained by Benjamin Callahan (benjamin DOT j DOT callahan AT gmail DOT com)
Documentation License: CC-BY 4.0