A DADA2 workflow for Big Data: Paired-end (1.4 or later)

More discussion of the computational strategy of the big data workflow is available in the single-read version.

Starting point

The workflow below expects demultiplexed, per-sample, gzipped fastq files for the primer-free forward reads of a Hiseq run to be in the directory pathF (defined below), and the corresponding primer-free reverse read files to be in a different directory pathR. The string parsing below expects forward and reverse filenames of the following format: samplename_XXX.fastq.gz, but other formats simply require modification to the filename parsing portions of these scripts.

NOTE: The filename parsing in this workflow expects the paired forward and reverse fastq files to have the same samplename_ prefix.

Filter

Please consider the filtering parameters, one set of parameters is not optimal for all datasets.

Filtering script:

library(dada2); packageVersion("dada2")
# File parsing
pathF <- "path/to/FWD" # CHANGE ME to the directory containing your demultiplexed forward-read fastqs
pathR <- "path/to/REV" # CHANGE ME ...
filtpathF <- file.path(pathF, "filtered") # Filtered forward files go into the pathF/filtered/ subdirectory
filtpathR <- file.path(pathR, "filtered") # ...
fastqFs <- sort(list.files(pathF, pattern="fastq.gz"))
fastqRs <- sort(list.files(pathR, pattern="fastq.gz"))
if(length(fastqFs) != length(fastqRs)) stop("Forward and reverse files do not match.")
# Filtering: THESE PARAMETERS ARENT OPTIMAL FOR ALL DATASETS
filterAndTrim(fwd=file.path(pathF, fastqFs), filt=file.path(filtpathF, fastqFs),
              rev=file.path(pathR, fastqRs), filt.rev=file.path(filtpathR, fastqRs),
              truncLen=c(240,200), maxEE=2, truncQ=11, maxN=0, rm.phix=TRUE,
              compress=TRUE, verbose=TRUE, multithread=TRUE)

Infer Sequence Variants

This portion of the workflow should be performed on a run-by-run basis, as error rates can differ between runs.

Sample inference script:

library(dada2); packageVersion("dada2")
# File parsing
filtpathF <- "path/to/FWD/filtered" # CHANGE ME to the directory containing your filtered forward fastqs
filtpathR <- "path/to/FWD/filtered" # CHANGE ME ...
filtFs <- list.files(filtpathF, pattern="fastq.gz", full.names = TRUE)
filtRs <- list.files(filtpathR, pattern="fastq.gz", full.names = TRUE)
sample.names <- sapply(strsplit(basename(filtFs), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz
sample.namesR <- sapply(strsplit(basename(filtRs), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz
if(!identical(sample.names, sample.namesR)) stop("Forward and reverse files do not match.")
names(filtFs) <- sample.names
names(filtRs) <- sample.names
set.seed(100)
# Learn forward error rates
errF <- learnErrors(filtFs, nbases=1e8, multithread=TRUE)
# Learn reverse error rates
errR <- learnErrors(filtRs, nbases=1e8, multithread=TRUE)
# Sample inference and merger of paired-end reads
mergers <- vector("list", length(sample.names))
names(mergers) <- sample.names
for(sam in sample.names) {
  cat("Processing:", sam, "\n")
    derepF <- derepFastq(filtFs[[sam]])
    ddF <- dada(derepF, err=errF, multithread=TRUE)
    derepR <- derepFastq(filtRs[[sam]])
    ddR <- dada(derepR, err=errR, multithread=TRUE)
    merger <- mergePairs(ddF, derepF, ddR, derepR)
    mergers[[sam]] <- merger
}
rm(derepF); rm(derepR)
# Construct sequence table and remove chimeras
seqtab <- makeSequenceTable(mergers)
saveRDS(seqtab, "path/to/run1/output/seqtab.rds") # CHANGE ME to where you want sequence table saved

The final result, the count matrix of samples (rows) by non-chimeric sequence variants (columns), is stored as as serialized R object. Read it back into R with foo <- readRDS("path/to/run1/output/seqtab.rds").

Merge Runs, Remove Chimeras, Assign Taxonomy

This part is the same as in the single-read version of this workflow.

Chimera/Taxonomy script:

library(dada2); packageVersion("dada2")
# Merge multiple runs (if necessary)
st1 <- readRDS("path/to/run1/output/seqtab.rds")
st2 <- readRDS("path/to/run2/output/seqtab.rds")
st3 <- readRDS("path/to/run3/output/seqtab.rds")
st.all <- mergeSequenceTables(st1, st2, st3)
# Remove chimeras
seqtab <- removeBimeraDenovo(st.all, method="consensus", multithread=TRUE)
# Assign taxonomy
tax <- assignTaxonomy(seqtab, "path/to/silva_nr_v128_train_set.fa.gz", multithread=TRUE)
# Write to disk
saveRDS(seqtab, "path/to/study/seqtab_final.rds") # CHANGE ME to where you want sequence table saved
saveRDS(tax, "path/to/study/tax_final.rds") # CHANGE ME ...

Maintained by Benjamin Callahan (benjamin DOT j DOT callahan AT gmail DOT com)