This past week a paper came out from a combination of the QIIME developers and the developers of some recent OTU clustering methods: Open-Source Sequence Clustering Methods Improve the State Of the Art. In this paper, several new and open source methods (Swarm, SUMACLUST, OTUCLUST) are compared to the standard methods in the field (QIIME/uclust, mothur, UPARSE).
As the developer of an open source method for inferring sample composition from amplicon sequencing, this presents an ideal opportunity to compare the accuracy our method, DADA2, to the “state of the art”.
To briefly recap, DADA2 is a method to infer the sample sequences in a collection of amplicon sequencing reads. DADA2 is a de novo method, and completely reference free. DADA2 does not create OTUs, it infers sample sequences exactly, resolving differences of as little as one nucleotide. DADA2 is an open-source method implemented as an R package available here.
This document is the compiled version of an R markdown document (available here) and all the code here can be replicated on your own computer.
Load packages and set working directory:
library(ShortRead); packageVersion("ShortRead")
## [1] '1.26.0'
library(dada2); packageVersion("dada2")
## [1] '0.10.3'
setwd("~/Desktop/OpenOTU/")
To run this R code yourself, install the DADA2 R package, download the supplementary data from the paper, unzip it, and set your R session’s working directory to the folder in which it was unzipped.
The first datasets considered in the paper are simulated datasets from a known mixture of template sequences. Errors are added using a software program that attempts to mimic the errors introduced by PCR amplification and Illumina sequencing. The GreenGenes code for the correct bacterial strain is included in the id line of the fastq file, thus we know the correct mapping for every read in these datasets.
The first dataset considered in the paper is a simulated dataset, consisting of an even distribution (100 reads each) of 1,076 randomly selected GreenGenes 16S V4 gene segments.
This is basically the easiest possible problem for these methods: There are no chimeras, errors are simulated and won’t include pathological cases, and the distribution of abundances is completely even. The one difficulty is that these strains are separated from each other by as little as one nucleotide in some cases.
When generating this dataset, the authors left in several sequence templates from GreenGenes that included ambiguous nucleotides (eg. K, Y…) in the V4 region. This results in sequencing reads containing those ambiguous bases in the fastq files, which would never actually occur in Illumina sequencing. So we’re going to start by removing those records:
getGGIDs <- function(fn) { # Extracts the GG IDs from the provided fastq file
srq <- readFastq(fn)
ids <- sapply(strsplit(as.character(id(srq)), "\\s"), `[`, 2)
ids <- sapply(strsplit(ids, "-"), `[`, 1)
ids
}
fn.even <- "supplemental_otu_clustering_datasets/UPARSE_not_filtered_QIIME_label_format/16S/even/seqs.fastq"
ids <- getGGIDs(fn.even)
srq <- readFastq(fn.even)
seqs <- as.character(sread(srq))
no.ambig <- sapply(seqs, dada2:::C_check_ACGT)
table(ids[!no.ambig])
##
## 111613 1141509 146992 158725 186887 3839317 512840 580633 7628
## 100 100 100 100 99 100 99 99 100
There are 9 variants with ambiguous bases in the V4 region being used as templates. We remove them:
ambig.ids <- names(table(ids[!no.ambig]))
fn.even.noN <- "even.noN.fastq.gz"
if(file.exists(fn.even.noN)) file.remove(fn.even.noN)
## [1] TRUE
writeFastq(srq[!ids %in% ambig.ids], fn.even.noN, compress=TRUE, width=20000L)
There are now 1,067 green-genes strains in the dataset, each with 100 reads.
This is simulated data, but its always good practice to look at the quality profile of the data before jumping in:
qq <- qa(fn.even.noN)[["perCycle"]]$quality
ShortRead:::.plotCycleQuality(qq, main="Sim - Even")
It looks very simulated. Typically we recommend trimming the first 10 bases and perhaps some at the end, as well as quality filtering based on expected errors. But in the interests of getting nice round numbers at the end, we’re going filter-free!
Entering the DADA2 pipeline (minus filtering and chimera-checking):
derep.even <- derepFastq(fn.even.noN, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: even.noN.fastq.gz
## Encountered 29105 unique sequences from 106700 total sequences read.
dada.even <- dada(derep.even, err=inflateErr(tperr1, 3), selfConsist=TRUE)
## Sample 1 - 106700 reads in 29105 unique sequences.
## selfConsist step 2
## selfConsist step 3
##
##
## Convergence after 3 rounds.
dada.even
## dada-class: object describing DADA2 denoising results
## 1055 sample sequences were inferred from 29105 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, BAND_SIZE = 16, USE_QUALS = TRUE
The DADA2 pipeline inferred 1,055 sample sequences. This is more than any of the other algorithms detected (see Table 2 in the paper), but it’s less than the 1,067 strains that made up this data.
Let’s take a look at the abundances of the inferred sample sequences.
table(dada.even$clustering$abundance)
##
## 98 99 100 101 102 200 301
## 1 3 1037 2 1 10 1
First off, we see that we are getting the abundance estimates essentially exactly right. 1,037/1,055 inferred sample sequences have an inferred abundance of 100, exactly what is expected. And this table also tells us why we aren’t getting 1,067 variants: Several strains have identical 16S sequences over this region. There are 10 pairs of identical strains (the 200 abundance sequences) and 1 triple of identical strains (300 abundance). After subtracting out the 12 strains with duplicate sequences, we have 1,055 unique sequence variants in the dataset, all of which were detected by DADA2.
We can dig deeper by using the GreenGenes ID in the id line of of the fastq file to check whether the reads of each strain are all being assigned to the same output sequence.
assigned.even <- dada.even$map[derep.even$map]
ggids.even <- getGGIDs(fn.even.noN)
joint.table <- table(ggids.even, assigned.even)
joint.mat <- as.matrix(joint.table)
# Rows are GreenGenes IDs from the fastq id line
# Columns are the number of the output sequence to which the read was assigned
# Entries are the number of reads with that ggID and assignment
table(rowSums(joint.mat>0))
##
## 1 2
## 1063 4
The assignment is almost perfect, in 1,063/1,067 cases, all reads with a given ID are assigned to the same output sequence by DADA2. We look more closely at the 4 remaining cases:
multi.assigned <- rowSums(joint.mat>0) > 1
joint.mat[multi.assigned, colSums(joint.mat[multi.assigned,]>0)>0]
## assigned.even
## ggids.even 1 115 251 281 1048 1052 1053 1055
## 178767 0 2 0 0 0 0 98 0
## 185411 1 0 0 0 0 0 0 99
## 4461030 0 0 0 99 1 0 0 0
## 554668 0 0 1 0 0 99 0 0
Even these assignments are nearly perfect. In the first case 2 reads of GG strain 178767 (out of 100) are assigned to a different output sequence than the rest. In the other 3 cases just 1 read (out of 100) is differently assigned.
This second dataset is simulated from the same set of GreenGenes 16S sequence templates, but now the abundances of the strains are allowed to vary in abundance from 1 to 197 reads.
We start by again removing the problematic sequence variants with ambiguous nucleotides:
fn.stag <- "supplemental_otu_clustering_datasets/UPARSE_not_filtered_QIIME_label_format/16S/staggered/seqs.fastq"
ids <- getGGIDs(fn.stag)
srq <- readFastq(fn.stag)
fn.stag.noN <- "stag.noN.fastq.gz"
if(file.exists(fn.stag.noN)) file.remove(fn.stag.noN)
## [1] TRUE
writeFastq(srq[!ids %in% ambig.ids], fn.stag.noN, compress=TRUE, width=20000L)
As before, there are now 1,067 green-genes strains in the dataset, representing 1,055 unique sequence variants.
Entering the DADA2 pipeline (minus filtering and chimera-checking):
derep.stag <- derepFastq(fn.stag.noN, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: stag.noN.fastq.gz
## Encountered 28651 unique sequences from 106006 total sequences read.
dada.stag <- dada(derep.stag, err=inflateErr(tperr1, 3), selfConsist=TRUE)
## Sample 1 - 106006 reads in 28651 unique sequences.
## selfConsist step 2
## selfConsist step 3
##
##
## Convergence after 3 rounds.
dada.stag
## dada-class: object describing DADA2 denoising results
## 1042 sample sequences were inferred from 28651 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, BAND_SIZE = 16, USE_QUALS = TRUE
DADA2 only infers 1,042 output sequences this time (of the 1,055 unique variants in the dataset). This is higher than any of the methods tested in the paper, in part because the high-resolution of DADA2 allows it to differentiate very similar sequence variants.
Let’s take a closer look at what was missed by using the fact that we know all the sequence variants from analyzing the even dataset. First we extract the GreenGenes IDs of the strains DADA2 missed in the staggered dataset:
all(dada.stag$clustering$sequence %in% dada.even$clustering$sequence)
## [1] TRUE
in.even.not.staggered <- which(!dada.even$clustering$sequence %in% dada.stag$clustering$sequence)
strain.lookup <- joint.mat[,in.even.not.staggered]
strain.lookup <- strain.lookup[rowSums(strain.lookup)>0,]
# strain.lookup # 1 strain per output sequence for these missed variants
missing.strains <- rownames(strain.lookup)
missing.strains
## [1] "1577238" "178767" "218785" "258871" "279281" "4309513" "4322107"
## [8] "4361688" "552844" "553035" "554668" "749724" "811456"
With the GGIDs of the missing strains in hand, let’s look at their presence in the staggered dataset:
ids.stag <- getGGIDs(fn.stag.noN)
table(ids.stag)[missing.strains]
## ids.stag
## 1577238 178767 218785 258871 279281 4309513 4322107 4361688 552844
## 1 5 1 1 1 3 3 1 1
## 553035 554668 749724 811456
## 1 3 3 1
There are 13 strains missed in the staggered dataset. Eight of the missing strains contribute only one read to the dataset. DADA2 does not call singleton variants (and singleton OTUs were excluded in the paper) because of how hard it is to identify real biological singletons from errors. The remaining missed strains had just 3 or 5 reads.
While we don’t get easily interpretable round abundances here, we can still evaluate the accuracy of DADA2’s inferred abundances overall:
# Map between staggered assignments and even assignments by matching sequences
even.assign.of.stag.assign <- match(dada.stag$sequence, dada.even$sequence)
# Get the GGids of the unique variants corresponding to each corresponding even assignment
# This uses our knowledge that even assignments of >2 reads were all correct
ggids.of.corresponding.unique.variants <- lapply(even.assign.of.stag.assign, function(col) rownames(joint.mat)[(joint.mat>80)[,col]])
# Calculate the abundances of each unique variant from the fastq ID field
unique.variant.abundances <- sapply(ggids.of.corresponding.unique.variants, function(ids) sum(table(ids.stag)[ids]))
# Visualize
plot(dada.stag$clustering$abundance, unique.variant.abundances, xlab="DADA2 inferred abundance", ylab="True abundance")
abline(0,1,col="red")
cor(dada.stag$clustering$abundance, unique.variant.abundances)
## [1] 0.999986
DADA2’s inferred abundances are extremely accurate.
The second set of test datasets used in the paper are set of mock communities (i.e. mixtures of known strains) that were sequenced for the Bokulich et al. paper Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing.
This mock community dataset (also Bokulich_6 in the paper) is a single sample of a mixture of 48 strains. It is the smallest Bokulich dataset, at just 251k 150nt reads sequences on a Miseq, but is also the most diverse.
It is not clear whether the phiX contamination in the Bokulich datasets were removed in the benchmarking paper, but we do so here regardless. For the unfamiliar, when performing Illumina amplicon sequencing, phiX DNA is added in order to maintain enough heterogeneity in base composition for the base-calling software to work. While the phiX DNA is often removed by the sequencing center, it is not uncommon for phiX sequences to remain, and you should always check for phiX before processing new amplicon sequencing data.
There are various ways to remove phiX contamination. My solution is to locally BLAST against the phiX genome, and remove any read which hit the phiX genome with a custom python script. My shell workflow is below. It requires a local installation of blast, the phix genome made into a blastdb, and this simple python script. These commands should be executed in the supplemental_otu_clustering_datasets/UPARSE_not_filtered_QIIME_label_format/16S/1688/ directory:
####### TO BE EXECUTED BY HAND IN YOUR SHELL
### awk '{ if (NR % 4 == 1) print ">",((NR+3)/4); else if (NR % 4 == 2) print $0; }' seqs.fastq > seqs.fasta
### blastn -query seqs.fasta -db phix -out phix_hits.out -max_target_seqs 1 -outfmt "6 qacc evalue"
### python rm_hits.py phix_hits.out seqs.fastq 1688_seqs_nophix.fastq
This reduces the number of sequencing reads from 250,903 to 247,244. You can also download the phiX free fastq file here.
There are general guidelines for quality filtering (for example, we generally recommending filtering out reads with more than 2 expected errors) but one should always look at your data and modify those guidelines as necessary.
Visualizing the quality profile:
fn.1688 <- "supplemental_otu_clustering_datasets/UPARSE_not_filtered_QIIME_label_format/16S/1688/1688_seqs_nophix.fastq"
qq <- qa(fn.1688)[["perCycle"]]$quality
ShortRead:::.plotCycleQuality(qq, main="Bokulich 1688")
The quality really falls off a cliff over the last 10 nucleotides. We’ll trim those, and perform our standard recommended filters on expected errors and Ns:
filt.1688 <- "filt_1688.fastq.gz"
if(file.exists(filt.1688)) file.remove(filt.1688)
## [1] TRUE
fastqFilter(fn.1688, filt.1688, truncLen=140, truncQ=2, maxEE=2, maxN=0, verbose=TRUE)
## Read in 247244 , outputted 206613 filtered sequences.
Executing the DADA2 pipeline (with chimera-checking this time):
derep.1688 <- derepFastq(filt.1688, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: filt_1688.fastq.gz
## Encountered 25159 unique sequences from 206613 total sequences read.
dada.1688 <- dada(derep.1688, err=inflateErr(tperr1, 3), selfConsist=TRUE)
## The supplied error matrix does not extend to maximum observed Quality Scores in derep (41).
## Extending error rates by repeating the last column of the Error Matrix (column 41).
## In selfConsist mode this should converge to the proper error rates, otherwise this is probably not what you want.
## Sample 1 - 206613 reads in 25159 unique sequences.
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## selfConsist step 5
## selfConsist step 6
##
##
## Convergence after 6 rounds.
bim.1688 <- isBimeraDenovo(dada.1688, verbose=TRUE)
## Identified 212 bimeras out of 276 input sequences.
clust.1688 <- dada.1688$clustering[!bim.1688,]
cat("DADA2 identified", nrow(clust.1688), "non-chimeric sequence variants.\n")
## DADA2 identified 64 non-chimeric sequence variants.
One of the bonuses of DADA2 is that in selfConsist mode the algorithm is jointly inferring the error rates in your data as a function of the quality score. Take a look:
plotErrors(dada.1688, nominalQ=TRUE)
The red line is the expected error rate if the quality scores were exactly right, the black line is DADA2’s fitted error model. Notice how the observed error rates flatten out at higher qualities. This is typical, although has improved in more recent Illumina machines/software.
This mock community is supposed to be a mixture of 48 bacterial strains (see Supplementary Table 6 in the original Bokulich paper), which is fairly close to the 64 sequence variants output by DADA2 (and a long ways away from the 1,000+ OTUs produced by Swarm, SUMACLUST and uclust). However, accuracy at the OTU-level wasn’t measured in the paper, the focus instead being on accuracy after grouping into broader taxonomic categories.
We’ll follow the paper’s approach and do the same here. We’ll assign taxonomies using DADA2’s assignTaxonomy function, which implements the RDP classifier method, and the GreenGenes 13.8 database clustered at 97% (the DADA2 formatted training fasta file is available here):
set.seed(100)
tax.1688 <- assignTaxonomy(clust.1688$sequence, "ggtrain_97.fa", minBoot=50)
tax.1688 <- gsub("\\[", "", tax.1688)
tax.1688 <- gsub("\\]", "", tax.1688)
justGenus <- function(taxs) {
rval <- character(0)
for(i in seq_along(taxs)) {
ranks <- strsplit(taxs[[i]], ";")[[1]]
rval[[i]] <- paste(ranks[1:6], collapse=";")
}
rval <- gsub(";NA", "", rval)
rval
}
genus.1688 <- justGenus(tax.1688)
From the original Bokulich paper (Supplementary Table 6), we know the the identities of the 48 strains intended to be in this mock community. We can extract the rRNA sequences from the known genomes of those strains, classify the sequenced region with RDP against GreenGenes, and thereby generate the taxonomies we expect to see in our data down to the genus level:
# Expected taxonomies
exp.1688 <- c("k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Bifidobacteriales;f__Bifidobacteriaceae;g__Bifidobacterium",
"k__Bacteria;p__Actinobacteria;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;g__Collinsella",
"k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides",
"k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Porphyromonadaceae;g__Parabacteroides",
"k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Rikenellaceae;g__Alistipes",
"k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus",
"k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Tissierellaceae;g__Anaerococcus", #
"k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__Clostridium",
"k__Bacteria;p__Firmicutes;c__Erysipelotrichi;o__Erysipelotrichales;f__Erysipelotrichaceae;g__Eubacterium", #
"k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae",
"k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Blautia",
"k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Clostridium",
"k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Dorea",
"k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Lachnospira", #
"k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Roseburia",
"k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae;g__Ruminococcus", #
"k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__Anaerotruncus",
"k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__Faecalibacterium",
"k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__Ruminococcus",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Edwardsiella",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae", #
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia", #
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Pantoea",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Proteus",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Providencia",
"k__Bacteria;p__Verrucomicrobia;c__Verrucomicrobiae;o__Verrucomicrobiales;f__Verrucomicrobiaceae;g__Akkermansia")
And then compare the output of DADA2 to those expectations:
cat("DADA2 detected", sum(exp.1688 %in% genus.1688), "out of", length(exp.1688), "expected taxonomies.\n")
## DADA2 detected 25 out of 26 expected taxonomies.
cat("Missing taxonomy:", exp.1688[!exp.1688 %in% genus.1688], "\n")
## Missing taxonomy: k__Bacteria;p__Actinobacteria;c__Coriobacteriia;o__Coriobacteriales;f__Coriobacteriaceae;g__Collinsella
cat("Within the Bacteroides genus, DADA2 detected", sum(grepl("Bacteroides", tax.1688)), "sequence variants.")
## Within the Bacteroides genus, DADA2 detected 12 sequence variants.
DADA2 has detected essentially all the expected taxonomies. The one exception appears to be because no Collinsella intestinalis sequences made it into the dataset. Furthermore, DADA2 detected real variation at higher-resolution than the genus level. For example, there were 12 Bacteroides among the 48 strains combined in this community, and DADA2 detected 12 Bacteroides sequence variants.
Now taking a closer look at the “false positives”, the sequences output by DADA2 that were not classified into an expected taxonomy:
unexp.1688 <- !genus.1688 %in% exp.1688
cat("DADA2 detected", sum(unexp.1688), "sequence variants with unexpected taxonomies with abundances:", clust.1688$abundance[unexp.1688], "\n")
## DADA2 detected 7 sequence variants with unexpected taxonomies with abundances: 743 141 93 11 11 23 81
uniquesToFasta(clust.1688[unexp.1688,], "unexpected_1688.fa")
# BLAST against nt on NCBI website
Every one of the 7 sequences from the 6 unexpected taxonomies is an exact (100% coverage and 100% identity) BLAST hit to nt. While unexpected, these are not false positives, they are correctly identified contaminantss.
getF <- function(exp, genus) {
TP <- sum(exp %in% genus)
FP <- length(unique(genus[!genus %in% exp]))
FN <- sum(!exp %in% genus)
precision <- TP/(TP+FP)
recall <- TP/(TP+FN)
Fscore <- 2*precision*recall/(precision+recall)
return(Fscore)
}
Following the accuracy methodology in the paper, DADA2 has an F-score of 0.877193 on this dataset, with all 6 “false positives” being “known”
This mock community dataset (also Bokulich_2 in the paper) is a combination of 4 samples of the same mixture of 18 strains. It is the largest Bokulich dataset evaluated in this paper, at 6.9M 250nt reads sequences from a Miseq. DADA2 expects different samples to be demultiplexed into individual fastq files, but this dataset is treated as one big sample in the paper so we follow that lead and just process it all together.
We again start by removing the phiX contamination:
####### TO BE EXECUTED BY HAND IN YOUR SHELL
### awk '{ if (NR % 4 == 1) print ">",((NR+3)/4); else if (NR % 4 == 2) print $0; }' seqs.fastq > seqs.fasta
### blastn -query seqs.fasta -db phix -out phix_hits.out -max_target_seqs 1 -outfmt "6 qacc evalue"
### python rm_hits.py phix_hits.out seqs.fastq 1685_seqs_nophix.fastq
### gzip 1685_seqs_nophix.fastq
This reduces the number of sequences from 6,938,836 to 6,848,112. You can also download the phiX free fastq.gz file here.
Inspecting the quality profile of the data:
fn.1685 <- "supplemental_otu_clustering_datasets/UPARSE_not_filtered_QIIME_label_format/16S/1685/1685_seqs_nophix.fastq.gz"
qq <- qa(fn.1685)[["perCycle"]]$quality
ShortRead:::.plotCycleQuality(qq, main="Bokulich 1685")
Quality scores drop sharply at the end of the sequences. We trim the last 20 bases and the first 10 bases (the first bases in Illumina sequencing are more likely to contain pathological errors):
filt.1685 <- "filt_1685.fastq.gz"
if(file.exists(filt.1685)) file.remove(filt.1685)
## [1] TRUE
fastqFilter(fn.1685, filt.1685, trimLeft=10, truncLen=230, truncQ=2, maxEE=2, maxN=0, verbose=TRUE)
## Read in 6848112 , outputted 5538338 filtered sequences.
Entering the DADA2 pipeline (this section involves some reasonably serious computation, and takes about 1.5 hours to complete on my laptop):
derep.1685 <- derepFastq(filt.1685, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: filt_1685.fastq.gz
## .....Encountered 264908 unique sequences from 5538338 total sequences read.
dada.1685 <- dada(derep.1685, err=inflateErr(tperr1,3), selfConsist=TRUE)
## Sample 1 - 5538338 reads in 264908 unique sequences.
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## selfConsist step 5
## selfConsist step 6
##
##
## Convergence after 6 rounds.
bim.1685 <- isBimeraDenovo(dada.1685, verbose=TRUE)
## Identified 712 bimeras out of 887 input sequences.
clust.1685 <- dada.1685$clustering[!bim.1685,]
cat("DADA2 identified", nrow(clust.1685), "non-chimeric sequence variants.\n")
## DADA2 identified 175 non-chimeric sequence variants.
set.seed(100)
tax.1685 <- assignTaxonomy(clust.1685$sequence, "ggtrain_97.fa", minBoot=50)
genus.1685 <- justGenus(tax.1685)
While DADA2 identified far fewer variants than any of the tested methods, it still found significantly more than the 18 variants expected. We’ll return to that point, but for now we’ll follow the same procedure of taxonomically classifying and comparing to expectations as is done in the paper.
First, the list of expected taxonomies (from classifying the type sequences of the strains listed in Bokulich Supplementary Table 4 against GreenGenes):
exp.1685 <- c("k__Archaea;p__Euryarchaeota;c__Methanobacteria;o__Methanobacteriales;f__Methanobacteriaceae;g__Methanobrevibacter",
"k__Bacteria;p__[Thermi];c__Deinococci;o__Deinococcales;f__Deinococcaceae;g__Deinococcus",
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Actinomycetaceae;g__Actinomyces",
"k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Propionibacteriaceae;g__Propionibacterium",
"k__Bacteria;p__Bacteroidetes;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides",
"k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae;g__Bacillus",
"k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Listeriaceae;g__Listeria",
"k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus",
"k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Enterococcaceae;g__Enterococcus",
"k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus",
"k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus",
"k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Clostridiaceae;g__Clostridium",
"k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhodobacterales;f__Rhodobacteraceae;g__Rhodobacter",
"k__Bacteria;p__Proteobacteria;c__Betaproteobacteria;o__Neisseriales;f__Neisseriaceae;g__Neisseria",
"k__Bacteria;p__Proteobacteria;c__Epsilonproteobacteria;o__Campylobacterales;f__Helicobacteraceae;g__Helicobacter",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas")
Now we compare the taxonomies assigned to the variants identified by DADA2 to these expected taxonomies:
cat("DADA2 detected", sum(exp.1685 %in% genus.1685), "out of", length(exp.1685), "expected taxonomies.\n")
## DADA2 detected 18 out of 18 expected taxonomies.
cat(sum(clust.1685$abundance[genus.1685 %in% exp.1685]), "reads out of", sum(clust.1685$abundance), "total reads were in the expected taxonomies.\n")
## 5405676 reads out of 5406042 total reads were in the expected taxonomies.
Over 99.99% of total reads were assigned to the expected taxonomies. But taking a look at the remaining:
unexp.1685 <- !genus.1685 %in% exp.1685
cat("DADA2 detected", sum(unexp.1685), "unexpected taxonomies with abundances:", clust.1685$abundance[unexp.1685], "\n")
## DADA2 detected 11 unexpected taxonomies with abundances: 143 96 81 18 7 5 4 3 4 3 2
print(genus.1685[unexp.1685])
## [1] "k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Xanthomonadales;f__Xanthomonadaceae;g__Stenotrophomonas"
## [2] "k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Hyphomicrobiaceae;g__Devosia"
## [3] "k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Serratia"
## [4] "k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae"
## [5] "k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pasteurellales;f__Pasteurellaceae;g__Haemophilus"
## [6] "k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Methylobacteriaceae;g__Methylobacterium"
## [7] "k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Paenibacillaceae;g__Brevibacillus"
## [8] "k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Nocardiaceae;g__Rhodococcus"
## [9] "k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Ruminococcaceae;g__Oscillospira"
## [10] "k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Corynebacteriaceae;g__Corynebacterium"
## [11] "k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Nocardioidaceae;g__Aeromicrobium"
uniquesToFasta(clust.1685[unexp.1685,], "unexpected_1685.fa")
# BLAST against nt on NCBI website
8 of the 11 sequences from unexpected taxonomies were exact BLAST matches, while the remaining 3 were within 99% identity of previously sequenced strains. Once again, these aren’t really false positives, they are correctly identified contaminants.
Following the accuracy methodology in the paper, DADA2 has an F-score of 0.7659574 on this dataset, with all 11 “false positives” being “known”
At the level of genera DADA2 very accurately reconstructed this mock community, but at the level of sequence variants the 175 sequences output by DADA2 is significantly above the 18 strains expected (although much closer to the truth than the 7,000+ OTUs output by many of the other tested method).
One of the nice things about DADA2 is that it provides some very useful diagnostics about each output sequence in the dadaObject$clustering
data.frame. Inspecting the head of that data.frame (after removing chimeras):
head(clust.1685)
## sequence
## 1 GCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTG
## 2 GCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTATTAAGTCGGATGTGAAATCCCCGAGCTTAACTTGGGAATTGCATTCGATACTGGTGAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACGCTGAGGTA
## 3 CCGAGCGTTGTCCGGATTTATTGGGCGTAAAGGGAGCGCAGGCGGTCAGGAAAGTCTGGAGTAAAAGGCTATGGCTCAACCATAGTGTGCTCTGGAAACTGTCTGACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAAGCGGCTCTCTGGTCTGTCACTGACGCTGAGGCTC
## 4 GCTAGCGTTATTCGGAATTACTGGGCGTAAAGCGCACGTAGGCGGATCGGAAAGTCAGAGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCTTTGAAACTCCCGATCTTGAGGTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTG
## 5 TCTAGTGGTAGCAGTTTTTATTGGGCCTAAAGCGTCCGTAGCCGGTTTAATAAGTCTCTGGTGAAATCCTGCAGCTTAACTGTGGGAATTGCTGGAGATACTATTAGACTTGAGATCGGGAGAGGTTAGAGGTACTCCCAGGGTAGAGGTGAAATTCTGTAATCCTGGGAGGACCGCCTGTTGCGAAGGCGTCTGACTGGAACGATTCTGACGGTGAGGG
## 6 GCAAGCGTTACCCGGAATCACTGGGCGTAAAGGGCGTGTAGGCGGAAATTTAAGTCTGGTTTTAAAGACCGGGGCTCAACCTCGGGGATGGACTGGATACTGGATTTCTTGACCTCTGGAGAGGTAACTGGAATTCCTGGTGTAGCGGTGGAATGCGTAGATACCAGGAGGAACACCAATGGCGAAGGCAAGTTACTGGACAGAAGGTGACGCTGAGGCG
## abundance n0 n1 nunq pval birth_type birth_pval birth_fold
## 1 3470621 2469608 816143 98048 NaN I NA NA
## 2 477476 350993 105973 18385 0 A 0 2.964138e+164
## 3 226740 164304 51792 10483 0 A 0 4.158736e+129
## 4 214808 150020 52492 11664 0 A 0 2.204744e+149
## 5 145966 108532 30482 6323 0 A 0 6.508727e+250
## 6 127239 87606 32193 6899 0 A 0 4.133789e+163
## birth_ham birth_qave
## 1 NA NA
## 2 47 35.15858
## 3 36 35.65817
## 4 43 34.08094
## 5 68 34.82169
## 6 49 33.53759
The sequence and abundance columns are self-explanatory, but several other columns are very useful in detecting spurious output sequences. In particular:
$n0
: The number of reads exactly matching the output sequence.$birth_fold
: The fold over-abundance of the read relative to expectations under the error model.$birth_ham
: The hamming distance of this sequence from the cluster from which it was split-off.$birth_qave
: The average quality-score at the differences that caused this cluster to be split-off.Low values of n0 relative to total abundance, low values of the birth_fold and birth_qave, and an overabundance of very low (typically just 1) birth_ham entries can all indicate spurious sequences.
Noting that the $clustering
data.frame is arranged in order of how confident DADA2 was in each sequence, we look at the bottom of the data.frame for clues on why we are getting too many output sequences:
tail(clust.1685[,-1], n=10)
## abundance n0 n1 nunq pval birth_type birth_pval
## 850 89 59 25 27 6.213046e-104 A 1.626157e-43
## 852 165 118 38 40 2.757087e-90 A 2.702259e-43
## 858 247 192 48 41 1.496014e-83 A 1.247807e-42
## 860 92 58 22 34 5.600672e-94 A 1.559984e-42
## 866 698 528 150 135 8.739117e-113 A 4.415625e-42
## 875 97 54 35 37 4.765873e-115 A 2.790611e-41
## 879 126 80 38 36 5.623624e-96 A 5.167375e-41
## 880 198 149 37 44 2.615952e-83 A 5.665123e-41
## 885 139 96 34 33 4.017443e-88 A 7.713226e-41
## 887 88 62 21 25 1.092467e-78 A 8.198294e-41
## birth_fold birth_ham birth_qave
## 850 20.771463 1 36.63462
## 852 5.713770 1 37.25641
## 858 3.575006 1 34.76963
## 860 15.840473 1 38.08621
## 866 2.022370 1 35.56381
## 875 19.500551 1 37.35294
## 879 8.699868 1 38.55000
## 880 4.315365 1 34.02721
## 885 6.861546 1 38.26316
## 887 13.021899 1 36.72581
table(clust.1685$birth_ham)
##
## 1 2 3 9 10 14 17 18 19 20 21 23 27 31 34 36 42 43
## 141 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1
## 45 47 48 49 50 51 53 57 58 68
## 2 1 1 1 1 1 2 2 1 1
This is clear indication that, for this dataset, DADA2 is not effectively discriminating real single-nucleotide variants from errors. There are various reasons why this happens, and DADA2 provides effective single-nucleotide resolution in many other datasets, but here it is not working.
So, we give up on having single-nucleotide resolution here, and re-cluster requiring a minimum hamming separation of 2 to call new sequence variants:
dada.1685.noSNV <- dada(derep.1685, err=dada.1685$err_out, MIN_HAMMING=2)
## No error function provided, no post-dada error estimates ($err_out) will be inferred.
## Sample 1 - 5538338 reads in 264908 unique sequences.
bim.1685.noSNV <- isBimeraDenovo(dada.1685.noSNV, verbose=TRUE)
## Identified 451 bimeras out of 490 input sequences.
clust.1685.noSNV <- dada.1685.noSNV$clustering[!bim.1685.noSNV,]
cat("DADA2 identified", nrow(clust.1685.noSNV), "non-chimeric sequence variants.\n")
## DADA2 identified 39 non-chimeric sequence variants.
uniquesToFasta(clust.1685.noSNV, "denoised.1685.noSNV.fasta")
# BLAST against nt on NCBI website
This is what we expect, the 18 expected strains and handful of contaminants. Note that the 39 variants output by DADA2 here includes every expected strain, include rare variants present at frequences <0.001%, and the sequences from unexpected genera almost all exactly match nt.
DADA2 was substantially more accurate than any of the methods tested in the paper, outperforming on both sensitivity and specificity.
For the even dataset, DADA2 detected every one of the 1,055 unique sequence variants, including those differing by just 1 nucleotide, more than any other method (957-1049). DADA2 reported 0 False Positives (amount unreported for other methods, but probably significant for some). DADA2’s estimated abundances were essentially perfect, and just 5/106,700 sequencing reads were misassigned.
For the staggered dataset, DADA2 identified 1,042/1,055 unique sequence variants, more than any other tested method (949-1035). DADA2 again output 0 False Positives (amount unreported for other methods, but probably significant for some). DADA2’s 13 False Negatives were all very low abundance (1-5 reads, <0.01% frequency). DADA2’s estimated abundances were extremely accurate (cor > 0.9999).
Simulated data without chimeras is the easiest possible case, and performance on such data is not guaranteed to translate to real sequencing datasets. That said, none of the algorithms tested in this paper approached DADA2’s level of accuracy. Tested at the “OTU” level, DADA2’s F-score on the Even dataset is 1, and on the Staggered dataset is 0.994.
For the Bokulich 1688 (or Bokulich_6) dataset, DADA2 output 64 sequence variants, closer to the expected number of 48 strains than any other method (many of which reported >1000 OTUs). At the genus level, DADA2 identified 25/26 expected taxonomies, with the one missed taxonomy likely being absent from the sequencing data. DADA2 identified 6 unexpected genera, all of them low abundance (<0.5%) and all of them exact BLAST matches to nt – these are all likely correctly identified contaminants. DADA2’s Fscore at the genus level was 0.877193, higher than any other tested method (0.45 - 0.61).
For the Bokulich 1685 (or Bokulich_2) dataset, DADA2 output 175 sequence variants, closer to the expected number of 18 strains than any other method. At the genus level, DADA2 identified 18/18 expected taxonomies. DADA2 identified 11 unexpected genera, all of them extremely low abundance (<0.01%) and all of them with a BLAST match of >99% identity to nt – these are all likely correctly identified contaminants. DADA2’s Fscore at the genus level was 0.7659574, higher than any other tested method (0.39-0.54).
The default application of DADA2 was already more accurate than any of the other tested methods on Bokulich 1685, but with one extra step of using the provided diagnostics, and thereby deciding to not call single-nucleotide variants, DADA2 output 39 sequence variants, an almost error-free collection of the expected strains and contaminants. None of the algorithms tested in this paper approached this level of accuracy.
Common tools, including many of those tested in the paper Open-Source Sequence Clustering Methods Improve the State Of the Art, report thousands of OTUs from amplicon sequencing data performed on mock communities of tens of strains due to inadequate treatment of PCR and sequencing error. This biases estimates of diversity, lowers the power of subsequent analysis, prevents the effective resolution of low-frequency variants, and limits amplicon studies to characterizing shifts in broad taxonomic groups.
As shown above, our method DADA2 improves on both the sensitivity and specificity of commonly used methods, detecting both more real variants and (far) fewer spurious ones. This difference can be dramatic: The common QIIME/uclust pipeline reports 20,084 OTUs in an 18-strain mock community, ~20,000 of which are spurious. In comparison, DADA2 without SNVs reports just 39 sequence variants, which consist of the 18 expected strains, a handful of contaminants, and ~5 spurious sequences. Of note, this dramatic reduction in spurious output is achieved without imposing aggressive filters.
Accurate analyses start with accurate data, and DADA2 improves the accuracy of amplicon sequencing data by powerfully separating sequence errors from true biological diversity.