This past week a paper came out from a combination of the QIIME developers and the developers of some recent OTU clustering methods: Open-Source Sequence Clustering Methods Improve the State Of the Art. In this paper, several new and open source methods (Swarm, SUMACLUST, OTUCLUST) are compared to the standard methods in the field (QIIME/uclust, mothur, UPARSE).

As the developer of an open source method for inferring sample composition from amplicon sequencing, this presents an ideal opportunity to compare the accuracy our method, DADA2, to the “state of the art”.

To briefly recap, DADA2 is a method to infer the sample sequences in a collection of amplicon sequencing reads. DADA2 is a de novo method, and completely reference free. DADA2 does not create OTUs, it infers sample sequences exactly, resolving differences of as little as one nucleotide. DADA2 is an open-source method implemented as an R package available here.

This document is the compiled version of an R markdown document (available here) and all the code here can be replicated on your own computer.

Load packages and set working directory:

library(ShortRead); packageVersion("ShortRead")
## [1] '1.26.0'
library(dada2); packageVersion("dada2")
## [1] '0.10.3'

To run this R code yourself, install the DADA2 R package, download the supplementary data from the paper, unzip it, and set your R session’s working directory to the folder in which it was unzipped.

Simulated data

The first datasets considered in the paper are simulated datasets from a known mixture of template sequences. Errors are added using a software program that attempts to mimic the errors introduced by PCR amplification and Illumina sequencing. The GreenGenes code for the correct bacterial strain is included in the id line of the fastq file, thus we know the correct mapping for every read in these datasets.

Even abundances dataset

The first dataset considered in the paper is a simulated dataset, consisting of an even distribution (100 reads each) of 1,076 randomly selected GreenGenes 16S V4 gene segments.

This is basically the easiest possible problem for these methods: There are no chimeras, errors are simulated and won’t include pathological cases, and the distribution of abundances is completely even. The one difficulty is that these strains are separated from each other by as little as one nucleotide in some cases.

Removing records with ambiguous nucleotides

When generating this dataset, the authors left in several sequence templates from GreenGenes that included ambiguous nucleotides (eg. K, Y…) in the V4 region. This results in sequencing reads containing those ambiguous bases in the fastq files, which would never actually occur in Illumina sequencing. So we’re going to start by removing those records:

getGGIDs <- function(fn) { # Extracts the GG IDs from the provided fastq file
  srq <- readFastq(fn)
  ids <- sapply(strsplit(as.character(id(srq)), "\\s"), `[`, 2)
  ids <- sapply(strsplit(ids, "-"), `[`, 1)

fn.even <- "supplemental_otu_clustering_datasets/UPARSE_not_filtered_QIIME_label_format/16S/even/seqs.fastq"
ids <- getGGIDs(fn.even)
srq <- readFastq(fn.even)
seqs <- as.character(sread(srq))
no.ambig <- sapply(seqs, dada2:::C_check_ACGT)
##  111613 1141509  146992  158725  186887 3839317  512840  580633    7628 
##     100     100     100     100      99     100      99      99     100

There are 9 variants with ambiguous bases in the V4 region being used as templates. We remove them:

ambig.ids <- names(table(ids[!no.ambig]))
fn.even.noN <- "even.noN.fastq.gz"
if(file.exists(fn.even.noN)) file.remove(fn.even.noN)
## [1] TRUE
writeFastq(srq[!ids %in% ambig.ids], fn.even.noN, compress=TRUE, width=20000L)

There are now 1,067 green-genes strains in the dataset, each with 100 reads.

Running DADA2

This is simulated data, but its always good practice to look at the quality profile of the data before jumping in:

qq <- qa(fn.even.noN)[["perCycle"]]$quality
ShortRead:::.plotCycleQuality(qq, main="Sim - Even")

It looks very simulated. Typically we recommend trimming the first 10 bases and perhaps some at the end, as well as quality filtering based on expected errors. But in the interests of getting nice round numbers at the end, we’re going filter-free!

Entering the DADA2 pipeline (minus filtering and chimera-checking):

derep.even <- derepFastq(fn.even.noN, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: even.noN.fastq.gz
## Encountered 29105 unique sequences from 106700 total sequences read.
dada.even <- dada(derep.even, err=inflateErr(tperr1, 3), selfConsist=TRUE)
## Sample 1 - 106700 reads in 29105 unique sequences.
##    selfConsist step 2 
##    selfConsist step 3 
## Convergence after  3  rounds.
## dada-class: object describing DADA2 denoising results
## 1055 sample sequences were inferred from 29105 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, BAND_SIZE = 16, USE_QUALS = TRUE

The DADA2 pipeline inferred 1,055 sample sequences. This is more than any of the other algorithms detected (see Table 2 in the paper), but it’s less than the 1,067 strains that made up this data.

Evaluating DADA2’s output

Let’s take a look at the abundances of the inferred sample sequences.

##   98   99  100  101  102  200  301 
##    1    3 1037    2    1   10    1

First off, we see that we are getting the abundance estimates essentially exactly right. 1,037/1,055 inferred sample sequences have an inferred abundance of 100, exactly what is expected. And this table also tells us why we aren’t getting 1,067 variants: Several strains have identical 16S sequences over this region. There are 10 pairs of identical strains (the 200 abundance sequences) and 1 triple of identical strains (300 abundance). After subtracting out the 12 strains with duplicate sequences, we have 1,055 unique sequence variants in the dataset, all of which were detected by DADA2.

We can dig deeper by using the GreenGenes ID in the id line of of the fastq file to check whether the reads of each strain are all being assigned to the same output sequence.

assigned.even <- dada.even$map[derep.even$map]
ggids.even <- getGGIDs(fn.even.noN)
joint.table <- table(ggids.even, assigned.even)
joint.mat <- as.matrix(joint.table)
# Rows are GreenGenes IDs from the fastq id line
# Columns are the number of the output sequence to which the read was assigned
# Entries are the number of reads with that ggID and assignment
##    1    2 
## 1063    4

The assignment is almost perfect, in 1,063/1,067 cases, all reads with a given ID are assigned to the same output sequence by DADA2. We look more closely at the 4 remaining cases:

multi.assigned <- rowSums(joint.mat>0) > 1
joint.mat[multi.assigned, colSums(joint.mat[multi.assigned,]>0)>0]
##           assigned.even
## ggids.even 1 115 251 281 1048 1052 1053 1055
##    178767  0   2   0   0    0    0   98    0
##    185411  1   0   0   0    0    0    0   99
##    4461030 0   0   0  99    1    0    0    0
##    554668  0   0   1   0    0   99    0    0

Even these assignments are nearly perfect. In the first case 2 reads of GG strain 178767 (out of 100) are assigned to a different output sequence than the rest. In the other 3 cases just 1 read (out of 100) is differently assigned.

Staggered abundances dataset

This second dataset is simulated from the same set of GreenGenes 16S sequence templates, but now the abundances of the strains are allowed to vary in abundance from 1 to 197 reads.

We start by again removing the problematic sequence variants with ambiguous nucleotides:

fn.stag <- "supplemental_otu_clustering_datasets/UPARSE_not_filtered_QIIME_label_format/16S/staggered/seqs.fastq"
ids <- getGGIDs(fn.stag)
srq <- readFastq(fn.stag)

fn.stag.noN <- "stag.noN.fastq.gz"
if(file.exists(fn.stag.noN)) file.remove(fn.stag.noN)
## [1] TRUE
writeFastq(srq[!ids %in% ambig.ids], fn.stag.noN, compress=TRUE, width=20000L)

As before, there are now 1,067 green-genes strains in the dataset, representing 1,055 unique sequence variants.

Running DADA2

Entering the DADA2 pipeline (minus filtering and chimera-checking):

derep.stag <- derepFastq(fn.stag.noN, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: stag.noN.fastq.gz
## Encountered 28651 unique sequences from 106006 total sequences read.
dada.stag <- dada(derep.stag, err=inflateErr(tperr1, 3), selfConsist=TRUE)
## Sample 1 - 106006 reads in 28651 unique sequences.
##    selfConsist step 2 
##    selfConsist step 3 
## Convergence after  3  rounds.
## dada-class: object describing DADA2 denoising results
## 1042 sample sequences were inferred from 28651 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, BAND_SIZE = 16, USE_QUALS = TRUE

DADA2 only infers 1,042 output sequences this time (of the 1,055 unique variants in the dataset). This is higher than any of the methods tested in the paper, in part because the high-resolution of DADA2 allows it to differentiate very similar sequence variants.

Evaluating DADA2’s output

Let’s take a closer look at what was missed by using the fact that we know all the sequence variants from analyzing the even dataset. First we extract the GreenGenes IDs of the strains DADA2 missed in the staggered dataset:

all(dada.stag$clustering$sequence %in% dada.even$clustering$sequence)
## [1] TRUE
in.even.not.staggered <- which(!dada.even$clustering$sequence %in% dada.stag$clustering$sequence)
strain.lookup <- joint.mat[,in.even.not.staggered]
strain.lookup <- strain.lookup[rowSums(strain.lookup)>0,]
# strain.lookup # 1 strain per output sequence for these missed variants
missing.strains <- rownames(strain.lookup)
##  [1] "1577238" "178767"  "218785"  "258871"  "279281"  "4309513" "4322107"
##  [8] "4361688" "552844"  "553035"  "554668"  "749724"  "811456"

With the GGIDs of the missing strains in hand, let’s look at their presence in the staggered dataset:

ids.stag <- getGGIDs(fn.stag.noN)
## ids.stag
## 1577238  178767  218785  258871  279281 4309513 4322107 4361688  552844 
##       1       5       1       1       1       3       3       1       1 
##  553035  554668  749724  811456 
##       1       3       3       1

There are 13 strains missed in the staggered dataset. Eight of the missing strains contribute only one read to the dataset. DADA2 does not call singleton variants (and singleton OTUs were excluded in the paper) because of how hard it is to identify real biological singletons from errors. The remaining missed strains had just 3 or 5 reads.

While we don’t get easily interpretable round abundances here, we can still evaluate the accuracy of DADA2’s inferred abundances overall:

# Map between staggered assignments and even assignments by matching sequences
even.assign.of.stag.assign <- match(dada.stag$sequence, dada.even$sequence)
# Get the GGids of the unique variants corresponding to each corresponding even assignment
# This uses our knowledge that even assignments of >2 reads were all correct
ggids.of.corresponding.unique.variants <- lapply(even.assign.of.stag.assign, function(col) rownames(joint.mat)[(joint.mat>80)[,col]])
# Calculate the abundances of each unique variant from the fastq ID field
unique.variant.abundances <- sapply(ggids.of.corresponding.unique.variants, function(ids) sum(table(ids.stag)[ids]))
# Visualize
plot(dada.stag$clustering$abundance, unique.variant.abundances, xlab="DADA2 inferred abundance", ylab="True abundance")

cor(dada.stag$clustering$abundance, unique.variant.abundances)
## [1] 0.999986

DADA2’s inferred abundances are extremely accurate.

Bokulich Mock Communities

The second set of test datasets used in the paper are set of mock communities (i.e. mixtures of known strains) that were sequenced for the Bokulich et al. paper Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing.

Mock Community 1688

This mock community dataset (also Bokulich_6 in the paper) is a single sample of a mixture of 48 strains. It is the smallest Bokulich dataset, at just 251k 150nt reads sequences on a Miseq, but is also the most diverse.

Removing PhiX

It is not clear whether the phiX contamination in the Bokulich datasets were removed in the benchmarking paper, but we do so here regardless. For the unfamiliar, when performing Illumina amplicon sequencing, phiX DNA is added in order to maintain enough heterogeneity in base composition for the base-calling software to work. While the phiX DNA is often removed by the sequencing center, it is not uncommon for phiX sequences to remain, and you should always check for phiX before processing new amplicon sequencing data.

There are various ways to remove phiX contamination. My solution is to locally BLAST against the phiX genome, and remove any read which hit the phiX genome with a custom python script. My shell workflow is below. It requires a local installation of blast, the phix genome made into a blastdb, and this simple python script. These commands should be executed in the supplemental_otu_clustering_datasets/UPARSE_not_filtered_QIIME_label_format/16S/1688/ directory:

### awk '{ if (NR % 4 == 1) print ">",((NR+3)/4); else if (NR % 4 == 2) print $0; }' seqs.fastq > seqs.fasta
### blastn -query seqs.fasta -db phix -out phix_hits.out -max_target_seqs 1 -outfmt "6 qacc evalue"
### python phix_hits.out seqs.fastq 1688_seqs_nophix.fastq

This reduces the number of sequencing reads from 250,903 to 247,244. You can also download the phiX free fastq file here.

Quality Filtering

There are general guidelines for quality filtering (for example, we generally recommending filtering out reads with more than 2 expected errors) but one should always look at your data and modify those guidelines as necessary.

Visualizing the quality profile:

fn.1688 <- "supplemental_otu_clustering_datasets/UPARSE_not_filtered_QIIME_label_format/16S/1688/1688_seqs_nophix.fastq"
qq <- qa(fn.1688)[["perCycle"]]$quality
ShortRead:::.plotCycleQuality(qq, main="Bokulich 1688")