assignSpecies functions require appropriately formatted fasta files describing the set of taxonomically assigned sequences to use as a training dataset. We provide appropriately formatted versions of several popular taxonomic databases, and describe the dada2-specific format for those who wish to use a custom database.
We maintain reference fastas for the three most common 16S databases: Silva, RDP and GreenGenes. The dada2 package recognizes and parses the General Fasta releases of the UNITE project for ITS taxonomic assignment. Formatted versions of other databases can be “contributed” and will be made available through this page if referencable by doi (eg. deposited at Zenodo or Figshare).
Please note that the files provided here are just derivative reformattings of these taxonomic databases. If using these files for taxonomic assignment, the source database should also be cited.
Deprecated: GreenGenes version 13.8 (the source GreenGenes database is no longer being maintained)
HitDB version 1 (Human InTestinal 16S rRNA)
PR2 version 4.7.2+. SEE NOTE BELOW.
Note: PR2 has different
taxLevels than the dada2 default. When assigning taxonomy against PR2, use the following:
assignTaxonomy(..., taxLevels = c("Kingdom","Supergroup","Division","Class","Order","Family","Genus","Species"))
Many thanks to the folks at RDP, Silva, GreenGenes, UNITE, GTDB, PR2 and others for making these amazing reference datbases available to the community. We created the dada2-compatible training fastas from the Silva NR99 and taxonomy files, the RDP trainset 16 and release 11.5 database, and the GreenGenes 13.8 OTUs clustered at 97%.
The scripts to parse and produce the officially maintained training fastas are included as private functions within the dada2 R package. Inspect
taxonomy.R to see the code. Currently training fastas for
assignSpecies are only available for the Silva and RDP databases.
Custom databases can be used with
assignSpecies provided they are in the dada2-compatible training fasta format. We thank the many contributors who have created custom databases suitable for use with DADA2, several of which are linked above.
assignTaxonomy(...) expects a training fasta file (or compressed fasta file) in which the taxonomy corresponding to each sequence is encoded in the id line in the following fashion (the second sequence is not assigned down to level 6):
>Level1;Level2;Level3;Level4;Level5;Level6; ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC >Level1;Level2;Level3;Level4;Level5; CGCTAGAAAGTCGTAGAAGGCTCGGAGGTTTGAAGCATCGCCCGATGGGATCTCGTTGCTGTAGCATGAGTACGGACATTCAGGGATCATAGGATAC
assignSpecies(...) expects the training data to be provided in the form of a fasta file (or compressed fasta file), with the id line formatted as follows:
>ID Genus species ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC >ID Genus species CGCTAGAAAGTCGTAGAAGGCTCGGAGGTTTGAAGCATCGCCCGATGGGATCTCGTTGCTGTAGCATGAGTACGGACATTCAGGGATCATAGGATAC