The assignTaxonomy and assignSpecies functions require appropriately formatted fasta files describing the set of taxonomically assigned sequences to use as a training dataset. We provide appropriately formatted versions of several popular taxonomic databases, and describe the dada2-specific format for those who wish to use a custom database.

DADA2-formatted reference databases

We maintain reference fastas for the three most common 16S databases: Silva, RDP and GreenGenes2. The dada2 package recognizes and parses the General Fasta releases of the UNITE project for ITS taxonomic assignment. Formatted versions of other databases can be “contributed” and will be made available through this page if referencable by doi (eg. deposited at Zenodo or Figshare).

Please note that the files provided here are just reformattings of these taxonomic databases. If using these files for taxonomic assignment, the source database should also be cited.


Maintained:

Contributed:

Note: PR2 has different taxLevels than the dada2 default. When assigning taxonomy against PR2, use the following: assignTaxonomy(..., taxLevels = c("Domain","Supergroup","Division","Subdivision", "Class","Order","Family","Genus","Species"))


Many thanks to the folks at RDP, Silva, GreenGenes, UNITE, GTDB, PR2 and others for making these amazing reference datbases available to the community. We created the dada2-compatible training fastas from the Silva NR99 and taxonomy files, the RDP trainset 18 database, and the GreenGenes 13.8 OTUs clustered at 97%.

The scripts to parse and produce the officially maintained training fastas are included as private functions within the dada2 R package. Inspect taxonomy.R to see the code. Currently training fastas for assignSpecies are only available for the Silva and RDP databases.

Formatting custom databases

Custom databases can be used with assignTaxonomy and assignSpecies provided they are in the dada2-compatible training fasta format. We thank the many contributors who have created custom databases suitable for use with DADA2, several of which are linked above.

assignTaxonomy(...) expects a training fasta file (or compressed fasta file) in which the taxonomy corresponding to each sequence is encoded in the id line in the following fashion (the second sequence is not assigned down to level 6):

>Level1;Level2;Level3;Level4;Level5;Level6;
ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC
>Level1;Level2;Level3;Level4;Level5;
CGCTAGAAAGTCGTAGAAGGCTCGGAGGTTTGAAGCATCGCCCGATGGGATCTCGTTGCTGTAGCATGAGTACGGACATTCAGGGATCATAGGATAC

assignSpecies(...) expects the training data to be provided in the form of a fasta file (or compressed fasta file), with the id line formatted as follows:

>ID Genus species
ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC
>ID Genus species
CGCTAGAAAGTCGTAGAAGGCTCGGAGGTTTGAAGCATCGCCCGATGGGATCTCGTTGCTGTAGCATGAGTACGGACATTCAGGGATCATAGGATAC

Maintained by Benjamin Callahan (benjamin DOT j DOT callahan AT gmail DOT com)
Documentation License: CC-BY 4.0