The dada function implements the high-resolution sample-inference at the core of the dada2 package. Because dada resolves amplicon sequence variants (ASVs) exactly, it is possible to analyze samples separately before combining them together in a final sequence table. However, the dada function also allows samples to be pooled together sample inference. Here we demonstrate that functionality, and discuss the pros and cons of pooling.

Getting ready

Load the necessary libraries:

library(dada2); packageVersion("dada2")
## [1] '1.8.0'

We’ll be working with the same data used in the tutorial, so grab that data if you don’t have it already and then set your working directory to its location:

path <- "~/MiSeq_SOP" # CHANGE ME to the directory containing the fastq files after unzipping.
fnFs <- list.files(path, pattern="_R1_001.fastq$", full.names=TRUE)
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)

We’ll trim and filter just the forward reads for simplicity:

filtFs <- file.path(path, "filtered", paste0(sample.names, "_filtered.fq.gz"))
filterAndTrim(fnFs, filtFs, maxEE=2, truncLen=240, multithread=TRUE)

Dereplicate and name our samples:

drps <- derepFastq(filtFs)
names(drps) <- sample.names


Pooling for error rate estimation

The first type of pooling we’ll consider is pooling samples for error rate estimation. This is the default functionality of the learnErrors function when passed a list of samples, but for comparison purposes we’ll first look at what happens if we estimate error rates using just one sample alone.

err1 <- learnErrors(drps[[1]], multithread=TRUE)
## 1734720 total bases in 7228 reads from 1 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .
##    selfConsist step 2
##    selfConsist step 3
##    selfConsist step 4
##    selfConsist step 5
## Convergence after  5  rounds.
plotErrors(err1, nominalQ=TRUE)

The two things to look for in the output of plotErrors are that the observed error rates (black dots) are reasonably well fit by the modeled error rates (black line), and that the error rates are sensible, in particular that they are mostly decreasing with quality score.

Now let’s estimate error rates using all the samples:

err <- learnErrors(drps, multithread=TRUE)
## 34016400 total bases in 141735 reads from 20 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 ....................
##    selfConsist step 2
##    selfConsist step 3
##    selfConsist step 4
##    selfConsist step 5
##    selfConsist step 6
## Convergence after  6  rounds.
plotErrors(err, nominalQ=TRUE)

The difference is not large, but the fit between the observed points and the modeled line has clearly improved by pooling together more data.

There are two additional things to note here. First, it is not advised to pool samples that don’t share an “error history”, in particular samples that come from different sequencing runs or different PCR protocols. Samples from different runs should typically be run through the dada function separately, so that the correct run-specific error rates can be learned.

Second, it is generally not necessary to estimate error rates across an entire sequencing run. Because error rate estimation requires multiple loops through the core sample inference algorithm, it increases compute time significantly. Therefore it is often desirable to estimate error rates on a subset of the samples, and then use those error rates to process the rest of the samples. This subset is controlled by the learnErrors(..., nbases=1e8) parameter.


Pooling for sample inference

De novo OTU methods must pool samples before processing them, as without pooling the labels between samples are not consistent and cannot be compared, i.e. OTU1 in sample 1 and OTU1 sample 2 won’t be the same. DADA2 resolves sequence variants exactly, and because DNA sequences are consistent labels samples can be processed independently and then combined, and this is the default behavior.

Independent sample processing has two major advantages: Computation time is linear in the number of samples, and memory requirements are flat with the number of samples. However, pooling allows information to be shared across samples, which makes it easier to resolve rare variants that were seen just once or twice in one sample but many times across samples. Pooled sample inference is also supported by calling dada(..., pool=TRUE). See also the pseudo-pooling approach introduced in version 1.8 that approximates full pooling in linear time.

Let’s start by processing our samples using the default sample-by-sample inference:

system.time(dd.sep <- dada(drps, err=err))
## Sample 1 - 7228 reads in 2014 unique sequences.
## Sample 2 - 5382 reads in 1667 unique sequences.
## Sample 3 - 5539 reads in 1501 unique sequences.
## Sample 4 - 2970 reads in 922 unique sequences.
## Sample 5 - 2977 reads in 955 unique sequences.
## Sample 6 - 4417 reads in 1314 unique sequences.
## Sample 7 - 6870 reads in 1799 unique sequences.
## Sample 8 - 4635 reads in 1460 unique sequences.
## Sample 9 - 15934 reads in 3680 unique sequences.
## Sample 10 - 11554 reads in 2803 unique sequences.
## Sample 11 - 12171 reads in 3079 unique sequences.
## Sample 12 - 5100 reads in 1590 unique sequences.
## Sample 13 - 18312 reads in 3772 unique sequences.
## Sample 14 - 6329 reads in 1502 unique sequences.
## Sample 15 - 4116 reads in 1213 unique sequences.
## Sample 16 - 7480 reads in 1859 unique sequences.
## Sample 17 - 4823 reads in 1207 unique sequences.
## Sample 18 - 4935 reads in 1401 unique sequences.
## Sample 19 - 6586 reads in 1739 unique sequences.
## Sample 20 - 4377 reads in 920 unique sequences.
##    user  system elapsed 
##  18.610   0.157  18.814

Now we pool the samples for sample inference:

system.time(dd.pool <- dada(drps, err=err, pool=TRUE))
## 20 samples were pooled: 141735 reads in 23616 unique sequences.
##    user  system elapsed 
##  39.288   0.421  40.000

The pooled sample inference took longer, about twice as long in this case, but the delta between independent and pooled processing grows as study size increases. In practice, pooled processing can be used for Miseq scale data (especially if taking advantage of multithreading) but sample-by-sample processing remains computationally tractable out to any study size (eg. 20 Hiseq runs).

We’ll also take a quick look at the differences in output:

st <- makeSequenceTable(dd.sep)
st.sep <- removeBimeraDenovo(st, multithread=TRUE)

st <- makeSequenceTable(dd.pool)
st.pool <- removeBimeraDenovo(st, multithread=TRUE)

dim(st.sep); dim(st.pool)
## [1]  20 279
## [1]  20 301

As expected, pooling detected a few more amplicon sequence variants (ASVs) due to an increased power to detect rare variants.

sq.sep <- getSequences(st.sep)
sq.pool <- getSequences(st.pool)
sum(!sq.sep %in% sq.pool);sum(sq.sep %in% sq.pool); sum(!sq.pool %in% sq.sep)
## [1] 21
## [1] 258
## [1] 43
sum(st.sep[,!sq.sep %in% sq.pool])
## [1] 198
sum(st.sep[,sq.sep %in% sq.pool])
## [1] 132666
sum(st.pool[,sq.pool %in% sq.sep])
## [1] 127450
sum(st.pool[,!sq.pool %in% sq.sep])
## [1] 1377

The large majority of ASVs, and the vast majority of reads (>99% here) are commonly assigned between these methods. Either provides an accurate reconstruction of your sampled communities.

Maintained by Benjamin Callahan (benjamin DOT j DOT callahan AT gmail DOT com)