The primary goal of the dada2 package is to infer the exact sequence variants in amplicon sequenced samples and their associated abundances. However, a common companion step, especially when sequencing taxonomic markers such as 16S or 18S rRNA gene or the ITS region in fungi, is to assign taxonomy to the inferred sequence variants (or OTUs) by comparing them to a reference database of sequences with known taxonomy.

The dada2 package provides two taxonomic assignment methods.

assignTaxonomy(...) implements the RDP naive Bayesian classifier method described in Wang et al. 2007. In short, the kmer profile of the sequences to be classified are compared against the kmer profiles of all sequences in a training set of sequences with assigned taxonomies. The reference sequence with the most similar profile is used to assign taxonomy to the query sequence, and then a bootstrapping approach is used to assess the confidence assignment at each taxonomic level.

If assignment down to the species level is desired, assignSpecies(...) uses exact string matching against a reference database to assign Genus species binomials. In short, query sequence are compared against all reference sequences that had binomial genus-species nomenclature assigned, and the genus-species of all exact matches are recorded and returned if sufficiently unambiguous.

These two functions can be combined to produce taxonomic assignments from Kingdom down to Species with the addSpecies(...) convenience function. Consistent training fastas for this purpose are provided for the Silva and RDP databases.

Getting started

library(dada2); packageVersion("dada2")
## [1] '1.4.0'

Both assignTaxonomy and assignSpecies depend on training datasets of taxonomically assigned sequences. Appropriately formatted datasets for several popular reference databases are available.

In order to follow along with the examples below, download the rdp_train_set_16.fa.gz and rdp_species_assignment_16.fa.gz files to an appropriate location on your local machine.

 

Taxonomic assignment

Here we classify a handful of 16S sequences from the v4 region against the RDP training set. Our sequences of interest:

seqs <- c(
"ACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGTTAAGCGAGATGTGAAAGCCCCGGGCTCAACCTGGGAATTGCATTTCGAACTGGCGAACTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGA",
"ACAGAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTGCGTAGGCTGTTTCGTAAGTCGTGTGTGAAAGGCGCGGGCTCAACCCGCGGACGGCACATGATACTGCGAGACTAGAGTAATGGAGGGGGAACCGGAATTCTCGGTGTAGCAGTGAAATGCGTAGATATCGAGAGGA",
"ACGGAGGGTGCAAGCGTTAATCGGAATCACTGGGCGTAAAGCGCACGTAGGCTGTTATGTAAGTCAGGGGTGAAAGCCCACGGCTCAACCGTGGAACTGCCCTTGATACTGCACGACTCGAATCCGGGAGAGGGTGGCGGAATTCCAGGTGTAGGAGTGAAATCCGTAGATATCTGGAGGA",
"ACGTAGGTCCCGAACGTTGCGCGAATTTACTGGGCGTAAAGGGTCCGTAGGCGGTTTAGCAAGTGGTTGGTGAAATTTCACGGCTCAACCGTGAAACTGCCTTCCAAACTGCTAAACTTGAGGCAGGGAGAGGTCGGCGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGA",
"ACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTGGAAAGTTGGGGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCTCCAAAACTATCAGTCTAGAGTTCGAGAGAGGTGAGTGGAATTCCAAGTGTAGAGGTGAAATTCGTAGATATTTGGAGGA",
"ACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGACAGTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGCTGTCTTGAGTACAGTAGAGGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGA",
"ACGTAGGGCGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGCTTGTCGCGTCGACTGTGAAAACCCGTGGCTCAACTGCGGGCTTGCAGTCGATACGGGCAGGCTAGAGTTCGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGA",
"ACGGAGGGTGCAAGCGTTAATCGGAATGACTGGGCGTAAAGCGCACGCAGGCGGTCTGTTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTCAAAACTGACAGGCTAGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGA",
"ACCGACGGCCCGAGTGGTGGCCACTCTTATTGGGCCTAAAGCGTCCGTAGCCGGTCCAGTAAGTCCCTGTTTAAATCCTGCGGCTTAACCGCAGGACTGGCAGGGATACTGCTGGACTTGGGACCGGGAGAGGACAAGGGTACTTCAGGGGTAGCGGTGAAATGTGTTGATCCTTGAAGGA",
"ACGTAGGTCCCGAACGTTGCGCGAAATTACTGGGCGTAAAGGGTCCGTAGGCGGTCTGGTAAGTGGAAGGTGAAAGCCCACGGCTCAACCGTGGAATTGCCTTCCAAACTGCTGGACTTGAGGGCGGAAGAGGTCGGCGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGA")

We run the assignTaxonomy(...) function with the RDP as the reference dataset. To run the following on your local machine, modify the filename in the second argument to point to where the rdp_train_set_16.fa.gz file sits on your computer.

set.seed(100) # Initialize random number generator for reproducibility
taxa <- assignTaxonomy(seqs, "Training/rdp_train_set_16.fa.gz", multithread=FALSE)
unname(taxa)
##       [,1]       [,2]              [,3]                 
##  [1,] "Bacteria" "Proteobacteria"  "Gammaproteobacteria"
##  [2,] "Bacteria" "Verrucomicrobia" "Verrucomicrobiae"   
##  [3,] "Bacteria" "Proteobacteria"  "Deltaproteobacteria"
##  [4,] "Bacteria" "Aquificae"       "Aquificae"          
##  [5,] "Bacteria" "Proteobacteria"  "Alphaproteobacteria"
##  [6,] "Bacteria" "Bacteroidetes"   "Bacteroidia"        
##  [7,] "Bacteria" "Actinobacteria"  "Actinobacteria"     
##  [8,] "Bacteria" "Proteobacteria"  "Gammaproteobacteria"
##  [9,] "Archaea"  "Euryarchaeota"   "Methanococci"       
## [10,] "Bacteria" "Aquificae"       "Aquificae"          
##       [,4]                 [,5]                  [,6]                  
##  [1,] "Alteromonadales"    "Shewanellaceae"      "Shewanella"          
##  [2,] "Verrucomicrobiales" "Verrucomicrobiaceae" "Akkermansia"         
##  [3,] "Desulfovibrionales" "Desulfovibrionaceae" "Desulfovibrio"       
##  [4,] "Aquificales"        "Hydrogenothermaceae" "Sulfurihydrogenibium"
##  [5,] "Rhodobacterales"    "Rhodobacteraceae"    "Loktanella"          
##  [6,] "Bacteroidales"      "Bacteroidaceae"      "Bacteroides"         
##  [7,] "Actinomycetales"    "Micromonosporaceae"  "Salinispora"         
##  [8,] "Enterobacteriales"  "Enterobacteriaceae"  "Dickeya"             
##  [9,] "Methanococcales"    "Methanococcaceae"    "Methanococcus"       
## [10,] "Aquificales"        "Hydrogenothermaceae" "Persephonella"

The return value of assignTaxonomy(...) is a character matrix, with each row corresponding to an input sequence, and each column corresponding to a taxonomic level. For the RDP training set, the levels are Kingdom, Phylum, Class, Order, Family, Genus, and we see that each sequence has been classified all the way to genus.

An important parameter to consider when running assignTaxonomy(...) is minBoot, which sets the minimum bootstrapping support required to return a taxonomic classification. The original paper recommended a threshold of 50 for sequences of 250nts or less (as these are) but a threshold of 80 generally. Let’s see what happens when we classify at that higher threshold:

set.seed(100) # Initialize random number generator for reproducibility
taxa.80 <- assignTaxonomy(seqs, "Training/rdp_train_set_16.fa.gz", minBoot=80)
unname(taxa.80)
##       [,1]       [,2]              [,3]                 
##  [1,] "Bacteria" "Proteobacteria"  "Gammaproteobacteria"
##  [2,] "Bacteria" "Verrucomicrobia" "Verrucomicrobiae"   
##  [3,] "Bacteria" "Proteobacteria"  "Deltaproteobacteria"
##  [4,] "Bacteria" "Aquificae"       "Aquificae"          
##  [5,] "Bacteria" "Proteobacteria"  "Alphaproteobacteria"
##  [6,] "Bacteria" "Bacteroidetes"   "Bacteroidia"        
##  [7,] "Bacteria" "Actinobacteria"  "Actinobacteria"     
##  [8,] "Bacteria" "Proteobacteria"  "Gammaproteobacteria"
##  [9,] "Archaea"  "Euryarchaeota"   "Methanococci"       
## [10,] "Bacteria" "Aquificae"       "Aquificae"          
##       [,4]                 [,5]                  [,6]                  
##  [1,] "Alteromonadales"    "Shewanellaceae"      "Shewanella"          
##  [2,] "Verrucomicrobiales" "Verrucomicrobiaceae" "Akkermansia"         
##  [3,] "Desulfovibrionales" "Desulfovibrionaceae" "Desulfovibrio"       
##  [4,] "Aquificales"        "Hydrogenothermaceae" "Sulfurihydrogenibium"
##  [5,] "Rhodobacterales"    "Rhodobacteraceae"    NA                    
##  [6,] "Bacteroidales"      "Bacteroidaceae"      "Bacteroides"         
##  [7,] "Actinomycetales"    "Micromonosporaceae"  NA                    
##  [8,] "Enterobacteriales"  "Enterobacteriaceae"  "Dickeya"             
##  [9,] "Methanococcales"    "Methanococcaceae"    "Methanococcus"       
## [10,] "Aquificales"        "Hydrogenothermaceae" "Persephonella"

At this more stringent threshold, sequence 5 and 7 are not classified at the genus level. Instead an NA is returned, indicating that less than minBoot=80 of the 100 bootstraps returned the same genus for those sequences.

Species assignment

We now run the assignSpecies function against the RDP species-level training set. To run the following on your local machine, modify the filename in the second argument to point to where the rdp_species_assignment_14.fa.gz file sits on your computer.

As of version 1.5.2, assignSpecies has been reimplemented to be much faster – more than 100x faster for common use cases.

genus.species <- assignSpecies(seqs, "Training/rdp_species_assignment_16.fa.gz")
unname(genus.species)
##       [,1]                   [,2]            
##  [1,] "Shewanella"           NA              
##  [2,] "Akkermansia"          "muciniphila"   
##  [3,] "Desulfovibrio"        "piger"         
##  [4,] "Sulfurihydrogenibium" "yellowstonense"
##  [5,] "Sulfitobacter"        "pontiacus"     
##  [6,] "Bacteroides"          NA              
##  [7,] "Salinispora"          NA              
##  [8,] NA                     NA              
##  [9,] NA                     NA              
## [10,] "Persephonella"        "marina"

By default the assignSpecies method only returns species assignments if there is no ambiguity, i.e. all exact matches were to the same species. However, given that we are generally working with fragments of the 16S gene, it is common that exact matches are made to multiple sequences that are identical over the sequenced region. This is often still useful information, so to have all sequence hits returned the returnMultiple=TRUE argument can be passed to the assignSpecies function:

unname(assignSpecies(seqs, "Training/rdp_species_assignment_16.fa.gz", allowMultiple=TRUE))
##       [,1]                  
##  [1,] "Shewanella"          
##  [2,] "Akkermansia"         
##  [3,] "Desulfovibrio"       
##  [4,] "Sulfurihydrogenibium"
##  [5,] "Sulfitobacter"       
##  [6,] "Bacteroides"         
##  [7,] "Salinispora"         
##  [8,] NA                    
##  [9,] NA                    
## [10,] "Persephonella"       
##       [,2]                                                                                   
##  [1,] "arctica/baltica/hafniensis/massilia/putrefaciens/vesiculosa"                          
##  [2,] "muciniphila"                                                                          
##  [3,] "piger"                                                                                
##  [4,] "yellowstonense"                                                                       
##  [5,] "pontiacus"                                                                            
##  [6,] "faecichinchillae/faecis/thetaiotaomicron"                                             
##  [7,] "arenicola/pacifica/tropica"                                                           
##  [8,] "aquatica/chrysanthemi/dadantii/dianthicola/nigrifluens/paradisiaca/roseae/solani/zeae"
##  [9,] NA                                                                                     
## [10,] "marina"

Several of these sequences exactly match multiple species. Setting allowMultiple to an integer value allows fine-grained control over this behavior. For example, allowMultiple=3 returns up to 3 different matched species, but if 4 or more are matched it returns NA.

Kingom to Species

In many cases it is desirable to link the species annotation in with the broader taxonomic annotation afforded by assignTaxonomy. The convenience function addSpecies serves this purpose, as it takes as input a taxonomy table, and outputs a table with an added species column. Only those genus-species binomials which are consistent with the genus assigned in the provided taxonomy table are retained in the output. Note that we have switched to the silva training fastas here for demonstration purposes.

tt <- assignTaxonomy(seqs, "Training/silva_nr_v128_train_set.fa.gz")
tt.plus <- addSpecies(tt, "Training/silva_species_assignment_v128.fa.gz", verbose=TRUE)
## 5 out of 10 were assigned to the species level.
## Of which 4 had genera consistent with the input table.
unname(tt.plus)
##       [,1]       [,2]              [,3]                 
##  [1,] "Bacteria" "Proteobacteria"  "Gammaproteobacteria"
##  [2,] "Bacteria" "Verrucomicrobia" "Verrucomicrobiae"   
##  [3,] "Bacteria" "Proteobacteria"  "Deltaproteobacteria"
##  [4,] "Bacteria" "Aquificae"       "Aquificae"          
##  [5,] "Bacteria" "Proteobacteria"  "Alphaproteobacteria"
##  [6,] "Bacteria" "Bacteroidetes"   "Bacteroidia"        
##  [7,] "Bacteria" "Actinobacteria"  "Actinobacteria"     
##  [8,] "Bacteria" "Proteobacteria"  "Gammaproteobacteria"
##  [9,] "Archaea"  "Euryarchaeota"   "Methanococci"       
## [10,] "Bacteria" "Aquificae"       "Aquificae"          
##       [,4]                 [,5]                  [,6]                  
##  [1,] "Alteromonadales"    "Shewanellaceae"      "Shewanella"          
##  [2,] "Verrucomicrobiales" "Verrucomicrobiaceae" "Akkermansia"         
##  [3,] "Desulfovibrionales" "Desulfovibrionaceae" "Desulfovibrio"       
##  [4,] "Aquificales"        "Hydrogenothermaceae" "Sulfurihydrogenibium"
##  [5,] "Rhodobacterales"    "Rhodobacteraceae"    NA                    
##  [6,] "Bacteroidales"      "Bacteroidaceae"      "Bacteroides"         
##  [7,] "Micromonosporales"  "Micromonosporaceae"  "Salinispora"         
##  [8,] "Enterobacteriales"  "Enterobacteriaceae"  "Dickeya"             
##  [9,] "Methanococcales"    "Methanococcaceae"    "Methanococcus"       
## [10,] "Aquificales"        "Hydrogenothermaceae" "Persephonella"       
##       [,7]            
##  [1,] NA              
##  [2,] "muciniphila"   
##  [3,] "piger"         
##  [4,] "yellowstonense"
##  [5,] NA              
##  [6,] NA              
##  [7,] NA              
##  [8,] NA              
##  [9,] NA              
## [10,] "marina"

For human microbiome data and V4 sequencing, we typically see a bit under half of the abundant ribosomal sequence variants (RSVs) assigned unambiguously to species-level, and a bit under 2/3rds assigned ambiguously (ie. allowing multiple species assignment). This fraction drops off in less sampled environments and for rare RSVs.


Maintained by Benjamin Callahan (benjamin DOT j DOT callahan AT gmail DOT com)