The assignTaxonomy
and assignSpecies
functions require appropriately formatted fasta files describing the set
of taxonomically assigned sequences to use as a training dataset. We
provide appropriately formatted versions of several popular taxonomic
databases, and describe the dada2-specific format for those who wish to
use a custom database.
We maintain reference fastas for the three most common 16S databases: Silva, RDP and GreenGenes2. The dada2 package recognizes and parses the General Fasta releases of the UNITE project for ITS taxonomic assignment. Formatted versions of other databases can be “contributed” and will be made available through this page if referencable by doi (eg. deposited at Zenodo or Figshare).
Please note that the files provided here are just reformattings of these taxonomic databases. If using these files for taxonomic assignment, the source database should also be cited.
Maintained:
Silva version 138.2 (version 138.1, version 132, version 128, version 123)
Contributed:
GTDB Version 202: Genome Taxonomy Database (More info on GTDB)
assignTaxonomy
and assignSpecies
RefSeq + RDP (NCBI RefSeq 16S rRNA database supplemented by RDP)
HitDB version 1 (Human InTestinal 16S rRNA)
MiDAS: Field Guide to the Microbes of Activated Sludge and Anaerobic Digesters
MIDORI Reference 2 (for taxonomic assignments of Eukaryota mitochondrial DNA sequences)
PR2 version 5.0.0+. SEE NOTE BELOW.
Note: PR2 has different taxLevels
than
the dada2 default. When assigning taxonomy against PR2, use the
following:
assignTaxonomy(..., taxLevels = c("Domain","Supergroup","Division","Subdivision", "Class","Order","Family","Genus","Species"))
Many thanks to the folks at RDP, Silva, GreenGenes, UNITE, GTDB, PR2 and others for making these amazing reference datbases available to the community. We created the dada2-compatible training fastas from the Silva NR99 and taxonomy files, the RDP trainset 18 database, and the GreenGenes 13.8 OTUs clustered at 97%.
The scripts to parse and produce the officially maintained training
fastas are included as private functions within the dada2 R package.
Inspect taxonomy.R
to see the code. Currently training
fastas for assignSpecies
are only available for the Silva
and RDP databases.
Custom databases can be used with assignTaxonomy
and
assignSpecies
provided they are in the dada2-compatible
training fasta format. We thank the many contributors who have created
custom databases suitable for use with DADA2, several of which are
linked above.
assignTaxonomy(...)
expects a training fasta file (or
compressed fasta file) in which the taxonomy corresponding to each
sequence is encoded in the id line in the following fashion (the second
sequence is not assigned down to level 6):
>Level1;Level2;Level3;Level4;Level5;Level6;
ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC
>Level1;Level2;Level3;Level4;Level5;
CGCTAGAAAGTCGTAGAAGGCTCGGAGGTTTGAAGCATCGCCCGATGGGATCTCGTTGCTGTAGCATGAGTACGGACATTCAGGGATCATAGGATAC
assignSpecies(...)
expects the training data to be
provided in the form of a fasta file (or compressed fasta file), with
the id line formatted as follows:
>ID Genus species
ACCTAGAAAGTCGTAGATCGAAGTTGAAGCATCGCCCGATGATCGTCTGAAGCTGTAGCATGAGTCGATTTTCACATTCAGGGATACCATAGGATAC
>ID Genus species
CGCTAGAAAGTCGTAGAAGGCTCGGAGGTTTGAAGCATCGCCCGATGGGATCTCGTTGCTGTAGCATGAGTACGGACATTCAGGGATCATAGGATAC