Taxonomic Assignment

The dada2 package provides two methods to assign taxonomy to phylogenetically informative marker-gene data, such as the 16S or 18S rRNA gene and the ITS region in fungi. The first and general-purpose method, assignTaxonomy, uses the naive Bayesian classifier method of Wang et al. 2007 to assign taxonomy across multiple ranks (e.g. Kingdom to Genus). The second method, assignSpecies, is intended for the more difficult task of taxonomic assignments at or near the resolution limit of the sequenced marker-gene, and was developed specifically for the task of making species-level assignments to 16S data.

Both assignTaxonomy and assignSpecies make taxonomic assignments by comparing to a set of taxonomically assigned sequences in a provided reference fasta file. Appropriately formatted reference fasta files for several popular reference databases are available, including Silva, GreenGenes, RDP and UNITE.

Getting started

library(dada2); packageVersion("dada2")

## [1] '1.11.4'

In order to follow along with the examples below, download the rdp_train_set_16.fa.gz and rdp_species_assignment_16.fa.gz files to an appropriate location on your local machine.

Taxonomic assignment

assignTaxonomy(...) implements the RDP naive Bayesian classifier method described in Wang et al. 2007. In short, the kmer profile of the sequences to be classified are compared against the kmer profiles of all sequences in a training set of sequences with assigned taxonomies. The reference sequence with the most similar profile is used to assign taxonomy to the query sequence, and then a bootstrapping approach is used to assess the confidence assignment at each taxonomic level.

Here we classify a handful of 16S sequences from the v4 region against the RDP training set. Our sequences of interest:

seqs <- c(
"ACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGTTAAGCGAGATGTGAAAGCCCCGGGCTCAACCTGGGAATTGCATTTCGAACTGGCGAACTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGA",
"ACAGAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTGCGTAGGCTGTTTCGTAAGTCGTGTGTGAAAGGCGCGGGCTCAACCCGCGGACGGCACATGATACTGCGAGACTAGAGTAATGGAGGGGGAACCGGAATTCTCGGTGTAGCAGTGAAATGCGTAGATATCGAGAGGA",
"ACGGAGGGTGCAAGCGTTAATCGGAATCACTGGGCGTAAAGCGCACGTAGGCTGTTATGTAAGTCAGGGGTGAAAGCCCACGGCTCAACCGTGGAACTGCCCTTGATACTGCACGACTCGAATCCGGGAGAGGGTGGCGGAATTCCAGGTGTAGGAGTGAAATCCGTAGATATCTGGAGGA",
"ACGTAGGTCCCGAACGTTGCGCGAATTTACTGGGCGTAAAGGGTCCGTAGGCGGTTTAGCAAGTGGTTGGTGAAATTTCACGGCTCAACCGTGAAACTGCCTTCCAAACTGCTAAACTTGAGGCAGGGAGAGGTCGGCGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGA",
"ACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTGGAAAGTTGGGGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCTCCAAAACTATCAGTCTAGAGTTCGAGAGAGGTGAGTGGAATTCCAAGTGTAGAGGTGAAATTCGTAGATATTTGGAGGA",
"ACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGACAGTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGCTGTCTTGAGTACAGTAGAGGTGGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGA",
"ACGTAGGGCGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGCTTGTCGCGTCGACTGTGAAAACCCGTGGCTCAACTGCGGGCTTGCAGTCGATACGGGCAGGCTAGAGTTCGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGA",
"ACGGAGGGTGCAAGCGTTAATCGGAATGACTGGGCGTAAAGCGCACGCAGGCGGTCTGTTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTCAAAACTGACAGGCTAGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGA",
"ACCGACGGCCCGAGTGGTGGCCACTCTTATTGGGCCTAAAGCGTCCGTAGCCGGTCCAGTAAGTCCCTGTTTAAATCCTGCGGCTTAACCGCAGGACTGGCAGGGATACTGCTGGACTTGGGACCGGGAGAGGACAAGGGTACTTCAGGGGTAGCGGTGAAATGTGTTGATCCTTGAAGGA",
"ACGTAGGTCCCGAACGTTGCGCGAAATTACTGGGCGTAAAGGGTCCGTAGGCGGTCTGGTAAGTGGAAGGTGAAAGCCCACGGCTCAACCGTGGAATTGCCTTCCAAACTGCTGGACTTGAGGGCGGAAGAGGTCGGCGGAATTCCCGGTGTAGCGGTGAAATGCGTAGATATCGGGAGGA")

We run the assignTaxonomy(...) function with the RDP as the reference dataset. To run the following on your local machine, modify the filename in the second argument to point to where the rdp_train_set_16.fa.gz file sits on your computer.

set.seed(100) # Initialize random number generator for reproducibility
taxa <- assignTaxonomy(seqs, "Training/rdp_train_set_16.fa.gz", multithread=FALSE)
unname(taxa)

##       [,1]       [,2]              [,3]                 
##  [1,] "Bacteria" "Proteobacteria"  "Gammaproteobacteria"
##  [2,] "Bacteria" "Verrucomicrobia" "Verrucomicrobiae"   
##  [3,] "Bacteria" "Proteobacteria"  "Deltaproteobacteria"
##  [4,] "Bacteria" "Aquificae"       "Aquificae"          
##  [5,] "Bacteria" "Proteobacteria"  "Alphaproteobacteria"
##  [6,] "Bacteria" "Bacteroidetes"   "Bacteroidia"        
##  [7,] "Bacteria" "Actinobacteria"  "Actinobacteria"     
##  [8,] "Bacteria" "Proteobacteria"  "Gammaproteobacteria"
##  [9,] "Archaea"  "Euryarchaeota"   "Methanococci"       
## [10,] "Bacteria" "Aquificae"       "Aquificae"          
##       [,4]                 [,5]                  [,6]                  
##  [1,] "Alteromonadales"    "Shewanellaceae"      "Shewanella"          
##  [2,] "Verrucomicrobiales" "Verrucomicrobiaceae" "Akkermansia"         
##  [3,] "Desulfovibrionales" "Desulfovibrionaceae" "Desulfovibrio"       
##  [4,] "Aquificales"        "Hydrogenothermaceae" "Sulfurihydrogenibium"
##  [5,] "Rhodobacterales"    "Rhodobacteraceae"    "Loktanella"          
##  [6,] "Bacteroidales"      "Bacteroidaceae"      "Bacteroides"         
##  [7,] "Actinomycetales"    "Micromonosporaceae"  "Salinispora"         
##  [8,] "Enterobacteriales"  "Enterobacteriaceae"  "Dickeya"             
##  [9,] "Methanococcales"    "Methanococcaceae"    "Methanococcus"       
## [10,] "Aquificales"        "Hydrogenothermaceae" "Persephonella"

The return value of assignTaxonomy(...) is a character matrix, with each row corresponding to an input sequence, and each column corresponding to a taxonomic level. For the RDP training set, the levels are Kingdom, Phylum, Class, Order, Family, Genus, and we see that each sequence has been classified all the way to genus.

An important parameter to consider when running assignTaxonomy(...) is minBoot, which sets the minimum bootstrapping support required to return a taxonomic classification. The original paper recommended a threshold of 50 for sequences of 250nts or less (as these are) but a threshold of 80 generally. Let’s see what happens when we classify at that higher threshold:

set.seed(100) # Initialize random number generator for reproducibility
taxa.80 <- assignTaxonomy(seqs, "Training/rdp_train_set_16.fa.gz", minBoot=80)
unname(taxa.80)

##       [,1]       [,2]              [,3]                 
##  [1,] "Bacteria" "Proteobacteria"  "Gammaproteobacteria"
##  [2,] "Bacteria" "Verrucomicrobia" "Verrucomicrobiae"   
##  [3,] "Bacteria" "Proteobacteria"  "Deltaproteobacteria"
##  [4,] "Bacteria" "Aquificae"       "Aquificae"          
##  [5,] "Bacteria" "Proteobacteria"  "Alphaproteobacteria"
##  [6,] "Bacteria" "Bacteroidetes"   "Bacteroidia"        
##  [7,] "Bacteria" "Actinobacteria"  "Actinobacteria"     
##  [8,] "Bacteria" "Proteobacteria"  "Gammaproteobacteria"
##  [9,] "Archaea"  "Euryarchaeota"   "Methanococci"       
## [10,] "Bacteria" "Aquificae"       "Aquificae"          
##       [,4]                 [,5]                  [,6]                  
##  [1,] "Alteromonadales"    "Shewanellaceae"      "Shewanella"          
##  [2,] "Verrucomicrobiales" "Verrucomicrobiaceae" "Akkermansia"         
##  [3,] "Desulfovibrionales" "Desulfovibrionaceae" "Desulfovibrio"       
##  [4,] "Aquificales"        "Hydrogenothermaceae" "Sulfurihydrogenibium"
##  [5,] "Rhodobacterales"    "Rhodobacteraceae"    NA                    
##  [6,] "Bacteroidales"      "Bacteroidaceae"      "Bacteroides"         
##  [7,] "Actinomycetales"    "Micromonosporaceae"  "Salinispora"         
##  [8,] "Enterobacteriales"  "Enterobacteriaceae"  "Dickeya"             
##  [9,] "Methanococcales"    "Methanococcaceae"    "Methanococcus"       
## [10,] "Aquificales"        "Hydrogenothermaceae" "Persephonella"

At this more stringent threshold, sequence 5 and 7 are not classified at the genus level. Instead an NA is returned, indicating that less than minBoot=80 of the 100 bootstraps returned the same genus for those sequences.

Species assignment

Fast and appropriate species-level assignment from 16S data is provided by the assignSpecies method. assignSpecies uses exact string matching against a reference database to assign Genus species binomials. In short, query sequence are compared against all reference sequences that had binomial genus-species nomenclature assigned, and the genus-species of all exact matches are recorded and returned if it is unambiguous. Recent results indicate that exact matching (or 100% identity) with amplicon sequence variants (ASVs) is the only appropriate method for species-level assignment to high-throughput 16S amplicon data.

We now run the assignSpecies function against the RDP species-level training set. To run the following on your local machine, modify the filename in the second argument to point to where the rdp_species_assignment_14.fa.gz file sits on your computer.

As of version 1.5.2, assignSpecies has been reimplemented to be much faster, and it now easily scales to handle even very large sets of sequences.

genus.species <- assignSpecies(seqs, "Training/rdp_species_assignment_16.fa.gz")
unname(genus.species)

##       [,1]                   [,2]            
##  [1,] "Shewanella"           NA              
##  [2,] "Akkermansia"          "muciniphila"   
##  [3,] "Desulfovibrio"        "piger"         
##  [4,] "Sulfurihydrogenibium" "yellowstonense"
##  [5,] "Sulfitobacter"        "pontiacus"     
##  [6,] "Bacteroides"          NA              
##  [7,] "Salinispora"          NA              
##  [8,] NA                     NA              
##  [9,] NA                     NA              
## [10,] "Persephonella"        "marina"

By default the assignSpecies method only returns species assignments if there is no ambiguity, i.e. all exact matches were to the same species. However, given that we are generally working with fragments of the 16S gene, it is common that exact matches are made to multiple sequences that are identical over the sequenced region. This is often still useful information, so to have all sequence hits returned the returnMultiple=TRUE argument can be passed to the assignSpecies function:

unname(assignSpecies(seqs, "Training/rdp_species_assignment_16.fa.gz", allowMultiple=TRUE))

##       [,1]                  
##  [1,] "Shewanella"          
##  [2,] "Akkermansia"         
##  [3,] "Desulfovibrio"       
##  [4,] "Sulfurihydrogenibium"
##  [5,] "Sulfitobacter"       
##  [6,] "Bacteroides"         
##  [7,] "Salinispora"         
##  [8,] NA                    
##  [9,] NA                    
## [10,] "Persephonella"       
##       [,2]                                                                                   
##  [1,] "arctica/baltica/hafniensis/massilia/putrefaciens/vesiculosa"                          
##  [2,] "muciniphila"                                                                          
##  [3,] "piger"                                                                                
##  [4,] "yellowstonense"                                                                       
##  [5,] "pontiacus"                                                                            
##  [6,] "faecichinchillae/faecis/thetaiotaomicron"                                             
##  [7,] "arenicola/pacifica/tropica"                                                           
##  [8,] "aquatica/chrysanthemi/dadantii/dianthicola/nigrifluens/paradisiaca/roseae/solani/zeae"
##  [9,] NA                                                                                     
## [10,] "marina"

Several of these sequences exactly match multiple species. Setting allowMultiple to an integer value allows fine-grained control over this behavior. For example, allowMultiple=3 returns up to 3 different matched species, but if 4 or more are matched it returns NA.

If using this workflow on your own data: In many environments, few sequences will be assigned to species level. That is OK! Reference databases are incomplete, and species assignment is at the limit of what is possible from 16S amplicon data. Remember, even if the sequenced organism is in the reference database, if another species shares the same 16S gene sequence it is impossible to unambiguously assign that sequence to species-level.

Kingom to Species

In many cases it is desirable to link the species annotation in with the broader taxonomic annotation afforded by assignTaxonomy. The convenience function addSpecies serves this purpose, as it takes as input a taxonomy table, and outputs a table with an added species column. Only those genus-species binomials which are consistent with the genus assigned in the provided taxonomy table are retained in the output. Note that we have switched to the silva training fastas here for demonstration purposes.

tt <- assignTaxonomy(seqs, "Training/silva_nr_v132_train_set.fa.gz")
tt.plus <- addSpecies(tt, "Training/silva_species_assignment_v132.fa.gz", verbose=TRUE)

## 5 out of 10 were assigned to the species level.
## Of which 4 had genera consistent with the input table.

unname(tt.plus)

##       [,1]       [,2]              [,3]                 
##  [1,] "Bacteria" "Proteobacteria"  "Gammaproteobacteria"
##  [2,] "Bacteria" "Verrucomicrobia" "Verrucomicrobiae"   
##  [3,] "Bacteria" "Proteobacteria"  "Deltaproteobacteria"
##  [4,] "Bacteria" "Aquificae"       "Aquificae"          
##  [5,] "Bacteria" "Proteobacteria"  "Alphaproteobacteria"
##  [6,] "Bacteria" "Bacteroidetes"   "Bacteroidia"        
##  [7,] "Bacteria" "Actinobacteria"  "Actinobacteria"     
##  [8,] "Bacteria" "Proteobacteria"  "Gammaproteobacteria"
##  [9,] "Archaea"  "Euryarchaeota"   "Methanococci"       
## [10,] "Bacteria" "Aquificae"       "Aquificae"          
##       [,4]                 [,5]                  [,6]                  
##  [1,] "Alteromonadales"    "Shewanellaceae"      "Shewanella"          
##  [2,] "Verrucomicrobiales" "Akkermansiaceae"     "Akkermansia"         
##  [3,] "Desulfovibrionales" "Desulfovibrionaceae" "Desulfovibrio"       
##  [4,] "Hydrogenothermales" "Hydrogenothermaceae" "Sulfurihydrogenibium"
##  [5,] "Rhodobacterales"    "Rhodobacteraceae"    "Loktanella"          
##  [6,] "Bacteroidales"      "Bacteroidaceae"      "Bacteroides"         
##  [7,] "Micromonosporales"  "Micromonosporaceae"  "Salinispora"         
##  [8,] "Enterobacteriales"  "Enterobacteriaceae"  "Dickeya"             
##  [9,] "Methanococcales"    "Methanococcaceae"    "Methanococcus"       
## [10,] "Hydrogenothermales" "Hydrogenothermaceae" "Persephonella"       
##       [,7]            
##  [1,] NA              
##  [2,] "muciniphila"   
##  [3,] "piger"         
##  [4,] "yellowstonense"
##  [5,] NA              
##  [6,] NA              
##  [7,] NA              
##  [8,] NA              
##  [9,] NA              
## [10,] "marina"

For human microbiome data and V4 sequencing, we typically see a bit under half of the abundant 16S amplicon sequence variants (ASVs) assigned unambiguously to species-level, and a bit under 2/3rds assigned ambiguously (ie. allowing multiple species assignment). This fraction drops off in less sampled environments and for rare ASVs.

Maintained by Benjamin Callahan (benjamin DOT j DOT callahan AT gmail DOT com)
Documentation License: CC-BY 4.0