How can I demultiplex my fastq files?

The dada2 workflow assumes that you are starting with demultiplexed fastq files. That is, there is one fastq (or one forward and one reverse) for each sample. This is commonly the format in which sequencing data is received from sequencing centers, but especially when using single-index barcoding it is also fairly common to receive un-demultiplexed data alongside an “index” fastq.

There are multiple methods to demultiplex such data. We use the idemp tool. The fastx-toolkit performs demultiplexing.

You can also demultiplex using QIIME1 by running split_libraries_fastq.py followed by split_sequence_file_on_sample_ids.py. If using QIIME1 to demultiplex paired-end data, we recommend turning off filtering as the QIIME filtering causes the forward/reverse reads to be in mismatched order. You can do this by passing split_libraries_fastq.py the following arguments: -r 999 -n 999 -q 0 -p 0.0001. The QIIME2 platform also supports demultiplexing for the EMP indexing format.

If primers are at the start of your reads and are a constant length (a common scenario) you can use the trimLeft = c(FWD_PRIMER_LEN, REV_PRIMER_LEN) argument of dada2’s filtering functions to remove the primers.

For more complex situations (e.g. heterogenity spacers, variable length amplicons such as the ITS region) we recommend the cutadapt tool or the trimmomatic tool. You can see a worked example, including verification of primer removal, in the ITS-specific version of the DADA2 workflow.

Please double-check that your primers have been removed! The ambiguous nucleotides in unremoved primers will interfere with the dada2 pipeline.

What if my forward and reverse reads aren’t in matching order?

This situation commonly arises when external filtering methods, like the QIIME demultiplexing script, are used and filter the forward and reverse reads independently. This can be remedied by using adding matchIDs=TRUE flag to the filterAndTrim or fastqPairedFilter functions. For example, if no more filtering is required, the following will retain just those reads that match between the forward and reverse fastq files (assumes Illumina fastq headers):

filterAndTrim(..., matchIDs=TRUE)

Can I use dada2 with my 454 or Ion Torrent data?

Yes. We recommend the following parameters for denoising pyrosequencing data (like IT and 454):

dada(..., HOMOPOLYMER_GAP_PENALTY=-1, BAND_SIZE=32)

We recommend additional (relative to the tutorial) filtering of 454 sequences by maximum length:

filterAndTrim(..., maxLen=XXX) # XXX depends on the chemistry

We recommend additional (relative to the tutorial) trimming of Ion Torrent sequences:

filterAndTrim(..., trimLeft=15)

Yes, although we strongly recommend using primers that allow your paired-end reads to overlap significantly.

The mergePairs(..., justConcatenate=TRUE) option allows the paired reads to be joined without any overlap, but with 10Ns inserted in between the forward and reverse reads. The chimera removal and assignTaxonomy functions will handle such merged reads, although some other functions may fail (eg. addSpecies).

What if the self-consistency loop terminated before convergence?

In almost all cases, just proceed with the learned error rates. While in most cases convergence is reached within the MAX_CONSIST=10 rounds of self-consistent looping allowed by default, the error model is suitable to use as long as it is close to convergence. The estimation of the error rates is important primarily because of the huge difference in error rates that can exist between different runs and sequencing technologies. The small fluctuations in slow-converging datasets are minor relative to those major differences, and are typically smaller than the uncertainty in the estimation methods anyway.

For those who want to be sure the non-converged error model is reasonable, there are two ways to inspect the error rates and diagnose any potential issues. The first is to use plotErrors(err, nominalQ=TRUE). Do the fitted error rates (black line) reasonably fit the observations (black points) and generally decrease with increasing Q? Second, inspect the output of dada2:::checkConvergence(err), which gives the sum of absolute differences between error rate parameters at each step. Are the final values much lower (multiple orders of magnitude) than the first two values? If the answers to both questions are yes, then it is appropriate to use the error rates as is. The perfectionist can try increasing the allowed number of self-consistency steps with e.g. learnErrors(..., MAX_CONSIST=20) and see if the error model improves further.

Why aren’t any of my reads being successfully merged?

Do your reads overlap after the trimming and filtering you performed? In the tutorial we analyzed 2x250 Miseq reads using the 515F and 805R primers, meaning the reads strongly overlap and we could trim as dictated by the quality scores. However, if you are using a less-overlapping sequencing/primer-set combination, you must retain enough sequence for the reads to healthily overlap (mergePairs requires 20 nts of overlap by default).

Another possibility is that you have bad data. One quick way to test this is to simply look at the forward reads alone. If it’s 16S data, does assignTaxonomy identify any of the, say, top 20 output sequences? If not, or if you are using non-16S data, try BLAST-ing the top output sequences against nt. Are they hitting what you expect? If not, you probably have a bad sequencing run on your hands.

Why are so many of my reads being removed as chimeras?

It is common for chimeric sequences to be a majority (even large majority) of the total number of unique sequence variants inferred by the dada(...) algorithm, but they should not be a majority of the sequencing reads. That is, there may be many chimeras, but they are relatively low abundance.

The most common reason that far too many reads are flagged as chimeric is that primer sequences were not removed prior to starting the dada2 workflow. The ambiguous nucleotides in universal primer sequences will be interpreted as real variation by the dada2 pipeline, and this interferes with the chimera algorithm. In most cases, if you see over 25% of your sequencing reads being flagged chimeric, remove the primers from your reads and restart the workflow with the clean data.