The dada function implements the high-resolution sample-inference at the core of the dada2 package. Because dada resolves amplicon sequence variants (ASVs) exactly, it is possible to analyze samples separately before combining them together in a final sequence table. However, the dada function also allows samples to be pooled together sample inference. Here we demonstrate that functionality, and discuss the pros and cons of pooling.


Getting ready

Load the necessary libraries:

library(dada2); packageVersion("dada2")
## [1] '1.11.4'

We’ll be working with the same data used in the tutorial, so grab that data if you don’t have it already and then set your working directory to its location:

path <- "~/MiSeq_SOP" # CHANGE ME to the directory containing the fastq files after unzipping.
fnFs <- list.files(path, pattern="_R1_001.fastq$", full.names=TRUE)
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)

We’ll trim and filter just the forward reads for simplicity:

filtFs <- file.path(path, "filtered", paste0(sample.names, "_filtered.fq.gz"))
names(filtFs) <- sample.names
filterAndTrim(fnFs, filtFs, maxEE=2, truncLen=240, multithread=TRUE)

And we’ll learn the error rates from these samples:

err <- learnErrors(filtFs, multi=TRUE)
## 34016400 total bases in 141735 reads from 20 samples will be used for learning the error rates.

Pooling for sample inference

De novo OTU methods must pool samples before processing them, as without pooling the labels between samples are not consistent and cannot be compared, i.e. OTU1 in sample 1 and OTU1 sample 2 won’t be the same. DADA2 resolves amplicon sequence variants exactly, and because exact DNA sequences are consistent labels, samples can be processed independently by DADA2 and then combined, and this is the default behavior.

Independent sample processing has two major advantages: Computation time is linear in the number of samples, and memory requirements are flat with the number of samples. However, pooling allows information to be shared across samples, which makes it easier to resolve rare variants that were present as singletons or doubletone in one sample but were present many times across samples. Pooled sample inference is supported by calling dada(..., pool=TRUE). See also the pseudo-pooling approach introduced in version 1.8 that approximates full pooling in linear time.

Let’s start by processing our samples using the default sample-by-sample inference:

system.time(dd.sep <- dada(filtFs, err=err))
## Sample 1 - 7228 reads in 2014 unique sequences.
## Sample 2 - 5382 reads in 1667 unique sequences.
## Sample 3 - 5539 reads in 1501 unique sequences.
## Sample 4 - 2970 reads in 922 unique sequences.
## Sample 5 - 2977 reads in 955 unique sequences.
## Sample 6 - 4417 reads in 1314 unique sequences.
## Sample 7 - 6870 reads in 1799 unique sequences.
## Sample 8 - 4635 reads in 1460 unique sequences.
## Sample 9 - 15934 reads in 3680 unique sequences.
## Sample 10 - 11554 reads in 2803 unique sequences.
## Sample 11 - 12171 reads in 3079 unique sequences.
## Sample 12 - 5100 reads in 1590 unique sequences.
## Sample 13 - 18312 reads in 3772 unique sequences.
## Sample 14 - 6329 reads in 1502 unique sequences.
## Sample 15 - 4116 reads in 1213 unique sequences.
## Sample 16 - 7480 reads in 1859 unique sequences.
## Sample 17 - 4823 reads in 1207 unique sequences.
## Sample 18 - 4935 reads in 1401 unique sequences.
## Sample 19 - 6586 reads in 1739 unique sequences.
## Sample 20 - 4377 reads in 920 unique sequences.
##    user  system elapsed 
##  20.927   0.618  21.699

Now we pool the samples for sample inference:

system.time(dd.pool <- dada(filtFs, err=err, pool=TRUE))
## 20 samples were pooled: 141735 reads in 23616 unique sequences.
##    user  system elapsed 
##  39.847   0.714  40.619

The pooled sample inference took longer, about twice as long in this case, but the delta between independent and pooled processing grows as study size increases. In practice, pooled processing can be used for Miseq scale data (especially if taking advantage of multithreading) but sample-by-sample processing remains computationally tractable out to any study size (eg. 20 Hiseq runs).

We’ll also take a quick look at the differences in output:

st <- makeSequenceTable(dd.sep)
st.sep <- removeBimeraDenovo(st, multithread=TRUE)

st <- makeSequenceTable(dd.pool)
st.pool <- removeBimeraDenovo(st, multithread=TRUE)

dim(st.sep); dim(st.pool)
## [1]  20 279
## [1]  20 301

As expected, pooling detected a few more amplicon sequence variants (ASVs) due to an increased power to detect rare variants.

sq.sep <- getSequences(st.sep)
sq.pool <- getSequences(st.pool)
sum(!sq.sep %in% sq.pool);sum(sq.sep %in% sq.pool); sum(!sq.pool %in% sq.sep)
## [1] 21
## [1] 258
## [1] 43
sum(st.sep[,!sq.sep %in% sq.pool])
## [1] 198
sum(st.sep[,sq.sep %in% sq.pool])
## [1] 132666
sum(st.pool[,sq.pool %in% sq.sep])
## [1] 127450
sum(st.pool[,!sq.pool %in% sq.sep])
## [1] 1377

The large majority of ASVs, and the vast majority of reads (>99%) are commonly assigned between these methods. Either method provides an accurate reconstruction of your sampled communities.


Maintained by Benjamin Callahan (benjamin DOT j DOT callahan AT gmail DOT com)
Documentation License: CC-BY 4.0