1 Identifying contaminants in marker-gene and metagenomics data

Introduction to the decontam R package. User support, feature requests, comments and suggestions are welcome at the decontam issues tracker. The preprint introducing decontam includes performance benchmarking, and demonstrates the benefits of decontam-inating your data for more accurate profiling of microbial communities.

2 Introduction

The investigation of environmental microbial communities and microbiomes has been transformed by the recent widespread adoption of culture-free high-throughput sequencing methods. In amplicon sequencing a particular genetic locus is amplified from DNA extracted from the community of interest, and then sequenced on a next-generation sequencing platform. In shotgun metagenomics, bulk DNA is extracted from the community of interest and sequenced. Both techniques provide cost-effective and culture-free characterizatoins of microbial communities.

However, the accuracy of these methods is limited in practice by the introduction of contaminating DNA that was not truly present in the sampled community. This contaminating DNA can come from several sources, such as the reagents used in the sequencing reaction, and can critically interfere with downstream analyses, especially in lower biomass environments. The decontam package provides simple statistical methods to identify and visualize contaminating DNA features, allowing them to be removed and a more accurate picture of sampled communities to be constructed from marker-gene and metagenomics data.

3 Necessary Ingredients

The first decontam ingredient is a feature table derived from your raw data, i.e. a table of the relative abundances of sequence features (columns) in each sample (rows). These “sequence features” can be any of a wide variety of feature types, including amplicon sequence variants (ASVs), operational taxonomic units (OTUs), taxonomic groups or phylotypes (e.g. genera), orthologous genes or metagenome-assembled-genomes (MAGs) – anything with a quantitative abundance derived from marker-gene or metagenomics data.

The second decontam ingredient is one of two types of metadata:

  1. DNA quantitation data recording the concentration of DNA in each sample. Most often this is collected in the form of fluorescent intensities measured prior to mixing samples into equimolar ratios for sequencing (and after PCR in amplicon sequencing), but sometimes it is collected via qPCR or other approach on extracted DNA.

  2. A defined set of “negative control” samples in which sequencing was performed on blanks without any biological sample added. Extraction controls are preferred, and in amplicon sequencing the negative controls should also be carried through the PCR step, as each step in the workflow has the potential to introduce new contaminants.

Finally, this data needs to be imported into R. The feature table should be imported as a sample-by-feature matrix and the sample metadata as a vector with length the number of samples. Alternatively, you can use the phyloseq package to import the data as a phyloseq object.

4 Setting up

The decontam package works with feature tables in the form of a standard R matrix, but it is even easier to work with phyloseq objects from the phyloseq package, which is designed to ease the analysis of marker-gene and metagenomics datasets.

In this introductory tutorial, we’ll start by reading in a phyloseq object to work with:

library(phyloseq); packageVersion("phyloseq")
## [1] '1.25.2'
library(ggplot2); packageVersion("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.4.4
## [1] '3.0.0'
library(decontam); packageVersion("decontam")
## [1] '1.1.1'
ps <- readRDS(system.file("extdata", "MUClite.rds", package="decontam"))
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1951 taxa and 569 samples ]
## sample_data() Sample Data:       [ 569 samples by 6 sample variables ]
## tax_table()   Taxonomy Table:    [ 1951 taxa by 6 taxonomic ranks ]

This phyloseq objects has a table of 1951 amplicon sequence variants (ASVs) inferred by the DADA2 algorithm from amplicon sequencing data of the V4 region of the 16S rRNA gene. The phyloseq objects also includes the sample metadata information needed to use decontam.

##                    X.SampleID PlateNumber Subject     Habitat quant_reading
## P1101C01701R00 P1101C01701R00           3    1101      Tongue          2829
## P1101C01702R00 P1101C01702R00           3    1101 Cheek_Right          5808
## P1101C01703R00 P1101C01703R00           3    1101  Cheek_Left          5052
## P1101C08701R00 P1101C08701R00           3    1101      Tongue          1227
## P1101C08702R00 P1101C08702R00           3    1101 Cheek_Right          3872
## P1101C08703R00 P1101C08703R00           3    1101  Cheek_Left          5872
##                Sample_or_Control
## P1101C01701R00       True Sample
## P1101C01702R00       True Sample
## P1101C01703R00       True Sample
## P1101C08701R00       True Sample
## P1101C08702R00       True Sample
## P1101C08703R00       True Sample

The key sample variables are quant_reading, which is the DNA concentration in each sample as measured by fluorescent intensity, and Sample_or_Control which tells us which samples are the negative controls.

5 Inspect Library Sizes

Let’s take a quick first look at the library sizes (i.e. the number of reads) in each sample, as a function of whether that sample was a true positive sample or a negative control:

df <- # Put sample_data into a ggplot-friendly data.frame
df$LibrarySize <- sample_sums(ps)
df <- df[order(df$LibrarySize),]
df$Index <- seq(nrow(df))
ggplot(data=df, aes(x=Index, y=LibrarySize, color=Sample_or_Control)) + geom_point()

The library sizes of the positive samples primarily fall from 15,000 to 40,000 reads, but there are some low-read outliers. The negative control samples have fewer reads as expected. Note: It is important keep the low-read samples for now, because we want to use those negative controls to help identify contaminants!

6 Identify Contaminants - Frequency

The first contaminant identification method we’ll use is the “frequency” method. In this method, the distribution of the frequency of each sequence feature as a function of the input DNA concentration is used to identify contaminants.

In our phyloseq object, "quant_reading" is the sample variable that holds the concentration information:

contamdf.freq <- isContaminant(ps, method="frequency", conc="quant_reading")
##             freq prev       p.freq p.prev            p contaminant
## Seq1 0.323002694  549 1.000000e+00     NA 1.000000e+00       FALSE
## Seq2 0.098667396  538 1.000000e+00     NA 1.000000e+00       FALSE
## Seq3 0.003551358  160 1.135975e-18     NA 1.135975e-18        TRUE
## Seq4 0.067588419  519 9.999998e-01     NA 9.999998e-01       FALSE
## Seq5 0.045174743  354 1.000000e+00     NA 1.000000e+00       FALSE
## Seq6 0.040417101  538 1.000000e+00     NA 1.000000e+00       FALSE

This calculation has returned a data.frame with several columns, the most important being $p which containts the probability that was used for classifying contaminants, and $contaminant which contains TRUE/FALSE classification values with TRUE indicating that the statistical evidence that the associated sequence feature is a contaminant exceeds the user-settable threshold. As we did not specify the threshold, the default value of threshold = 0.1 was used, and $contaminant=TRUE if $p < 0.1.

##  1893    58
## [1]   3  30  53  80 152 175

Just 58 out of the 1901 ASVs were classified as contaminants, but this includes some very abundant sequences, including the third (!) most abundant ASV.

Let’s take a look at what a clear non-contaminant (the 1st ASV), and a clear contaminant (the 3rd ASV), look like:

plot_frequency(ps, taxa_names(ps)[c(1,3)], conc="quant_reading") + 
  xlab("DNA Concentration (PicoGreen fluorescent intensity)")