Data

This data is from a set of 12 fecal samples from 6 subjects. 2 of the subjects gave multiple longitudinal samples at 1-6 month intervals.

The data were processed from extracted gDNA by PacBio using their 16S protocol and the Sequel sequencing instrument. Replicate 1 was sequenced using the S/P2-C2/5.0 sequencing chemistry, and Replicate 2 was sequenced with a pre-release version of the S/P3-C3/5.0 sequencing chemistry.

CCS reads were constructed using PacBio’s software, default parameters (including minPasses of 3) and a minPredictedAccuracy of 99.9%.

Setup

Load libraries, setup paths, prepare environment:

library(dada2);packageVersion("dada2")
## [1] '1.12.1'
library(Biostrings); packageVersion("Biostrings")
## [1] '2.46.0'
library(ShortRead); packageVersion("ShortRead")
## [1] '1.36.1'
library(ggplot2); packageVersion("ggplot2")
## [1] '3.1.0'
library(reshape2); packageVersion("reshape2")
## [1] '1.4.3'
library(gridExtra); packageVersion("gridExtra")
## [1] '2.3'
library(phyloseq); packageVersion("phyloseq")
## [1] '1.25.2'
path1 <- "~/Desktop/LRAS/Data/Fecal1" # CHANGE ME to location of the First Replicate fastq files
path2 <- "~/Desktop/LRAS/Data/Fecal2" # CHANGE ME to location of the Second Replicate fastq files
path.out <- "Figures/"
path.rds <- "RDS/"
fns1 <- list.files(path1, pattern="fastq.gz", full.names=TRUE)
fns2 <- list.files(path2, pattern="fastq.gz", full.names=TRUE)
F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"
rc <- dada2:::rc
theme_set(theme_bw())

Process Replicate 2

We will start by processing replicate 2, the high depth replicate, which we will use to explore the E. coli ASVs in this data.

Remove Primers and Filter

Remove primers and orient reads:

nops2 <- file.path(path2, "noprimers", basename(fns2))
prim2 <- removePrimers(fns2, nops2, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE)

Higher loss here than in the mock communities, but still very reasonable. Probably not surprising its a bit higher given this is a real sample rather than mock DNA, and the DNA extraction step was done by CoreBiome without special care to ensuring higher molecular weight DNA.

Inspect length distribution.

lens.fn <- lapply(nops2, function(fn) nchar(getSequences(fn)))
lens <- do.call(c, lens.fn)
hist(lens, 100)

Very strong peak ~1450, looks good.

Filter:

filts2 <- file.path(path2, "noprimers", "filtered", basename(fns2))
track2 <- filterAndTrim(nops2, filts2, minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2)
track2
##                                      reads.in reads.out
## Callahan_16S_R_11.1.ccs99.9.fastq.gz    17436     17147
## Callahan_16S_R_3.1.ccs99.9.fastq.gz     26605     26190
## Callahan_16S_R_3.2.ccs99.9.fastq.gz     24477     24081
## Callahan_16S_R_3.3.ccs99.9.fastq.gz     16738     16474
## Callahan_16S_R_9.1.ccs99.9.fastq.gz     28623     28171
## Callahan_16S_R_9.1B.ccs99.9.fastq.gz    24832     24453
## Callahan_16S_R_9.2.ccs99.9.fastq.gz     21759     21411
## Callahan_16S_R_9.3.ccs99.9.fastq.gz     18701     18429
## Callahan_16S_R_9.4.ccs99.9.fastq.gz     13481     13258

Losing <2% of reads per sample. Mean/Median reads per sample is over 20K! No low depth outliers.

Run DADA2

Dereplicate:

drp2 <- derepFastq(filts2, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_11.1.ccs99.9.fastq.gz
## Encountered 4468 unique sequences from 17147 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_3.1.ccs99.9.fastq.gz
## Encountered 4958 unique sequences from 26190 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_3.2.ccs99.9.fastq.gz
## Encountered 5326 unique sequences from 24081 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_3.3.ccs99.9.fastq.gz
## Encountered 3668 unique sequences from 16474 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_9.1.ccs99.9.fastq.gz
## Encountered 6385 unique sequences from 28171 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_9.1B.ccs99.9.fastq.gz
## Encountered 5529 unique sequences from 24453 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_9.2.ccs99.9.fastq.gz
## Encountered 5519 unique sequences from 21411 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_9.3.ccs99.9.fastq.gz
## Encountered 5152 unique sequences from 18429 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_9.4.ccs99.9.fastq.gz
## Encountered 3772 unique sequences from 13258 total sequences read.

Roughly 1/4th unique sequences in these samples.

Learn errors:

err2 <- learnErrors(drp2, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE)
## 122486022 total bases in 83892 reads from 4 samples will be used for learning the error rates.
saveRDS(err2, file.path(path.rds, "Fecal_err2.rds"))

Inspect errors:

plotErrors(err2)
## Warning: Transformation introduced infinite values in continuous y-axis