The assignTaxonomy and assignSpecies functions require appropriately formatted fasta files describing the set of taxonomically assigned sequences to use as a training dataset. We provide appropriately formatted versions of several popular taxonomic databases, and describe the dada2-specific format for those who wish to use a custom database.

## DADA2-formatted reference databases

We maintain reference fastas for the three most common 16S databases: Silva, RDP and GreenGenes. As of version 1.3.3 the dada2 package recognizes and parses the General Fasta releases of the UNITE project for ITS taxonomic assignment. Formatted versions of other databases can be “contibuted” and will be made available through this page if referencable by doi (eg. deposited at Zenodo or Figshare).

Please note that the files provided here are just derivative reformattings of these taxonomic databases. If using these files for taxonomic assignment, the source database should also be cited.

Maintained:

Contributed:

Note: PR2 has different taxLevels than the dada2 default. When assigning taxonomy against PR2, use the following: assignTaxonomy(..., taxLevels = c("Kingdom","Supergroup","Division","Class","Order","Family","Genus","Species"))

In addition to thanking the folks at RDP, Silva, GreenGenes and UNITE for making these datasets available, we also want to thank Pat Schloss and the mothur team for their work compiling the Silva data. We created the dada2-compatible training fastas from the mothur-compatible Silva training data (described here, and license here), the RDP trainset 16 and release 11.5 database, and the GreenGenes 13.8 OTUs clustered at 97%.

As of version 1.5.1, the scripts to parse and produce the dada2-formatted training fastas are included as private functions within the dada2 R package. Inspect taxonomy.R to see the code. Currently training fastas for assignSpecies are only available for the Silva and RDP databases.

## Formatting custom databases

Custom databases can be used with assignTaxonomy and assignSpecies provided they are in the dada2-compatible training fasta format.

assignTaxonomy(...) expects a training fasta file (or compressed fasta file) in which the taxonomy corresponding to each sequence is encoded in the id line in the following fashion (the second sequence is not assigned down to level 6):

>Level1;Level2;Level3;Level4;Level5;Level6;
ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC
>Level1;Level2;Level3;Level4;Level5;
CGCTAGAAAGTCGTAGAAGGCTCGGAGGTTTGAAGCATCGCCCGATGGGATCTCGTTGCTGTAGCATGAGTACGGACATTCAGGGATCATAGGATAC

assignSpecies(...) expects the training data to be provided in the form of a fasta file (or compressed fasta file), with the id line formatted as follows:

>ID Genus species
ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC
>ID Genus species
CGCTAGAAAGTCGTAGAAGGCTCGGAGGTTTGAAGCATCGCCCGATGGGATCTCGTTGCTGTAGCATGAGTACGGACATTCAGGGATCATAGGATAC

Maintained by Benjamin Callahan (benjamin DOT j DOT callahan AT gmail DOT com)
Documentation License: CC-BY 4.0