This data is from a set of 12 fecal samples from 6 subjects. 2 of the subjects gave multiple longitudinal samples at 1-6 month intervals.
The data were processed from extracted gDNA by PacBio using their 16S protocol and the Sequel sequencing instrument. Replicate 1 was sequenced using the S/P2-C2/5.0 sequencing chemistry, and Replicate 2 was sequenced with a pre-release version of the S/P3-C3/5.0 sequencing chemistry.
CCS reads were constructed using PacBio’s software, default parameters (including minPasses
of 3) and a minPredictedAccuracy
of 99.9%.
Load libraries, setup paths, prepare environment:
library(dada2);packageVersion("dada2")
## [1] '1.12.1'
library(Biostrings); packageVersion("Biostrings")
## [1] '2.46.0'
library(ShortRead); packageVersion("ShortRead")
## [1] '1.36.1'
library(ggplot2); packageVersion("ggplot2")
## [1] '3.1.0'
library(reshape2); packageVersion("reshape2")
## [1] '1.4.3'
library(gridExtra); packageVersion("gridExtra")
## [1] '2.3'
library(phyloseq); packageVersion("phyloseq")
## [1] '1.25.2'
path1 <- "~/Desktop/LRAS/Data/Fecal1" # CHANGE ME to location of the First Replicate fastq files
path2 <- "~/Desktop/LRAS/Data/Fecal2" # CHANGE ME to location of the Second Replicate fastq files
path.out <- "Figures/"
path.rds <- "RDS/"
fns1 <- list.files(path1, pattern="fastq.gz", full.names=TRUE)
fns2 <- list.files(path2, pattern="fastq.gz", full.names=TRUE)
F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"
rc <- dada2:::rc
theme_set(theme_bw())
We will start by processing replicate 2, the high depth replicate, which we will use to explore the E. coli ASVs in this data.
Remove primers and orient reads:
nops2 <- file.path(path2, "noprimers", basename(fns2))
prim2 <- removePrimers(fns2, nops2, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE)
Higher loss here than in the mock communities, but still very reasonable. Probably not surprising its a bit higher given this is a real sample rather than mock DNA, and the DNA extraction step was done by CoreBiome without special care to ensuring higher molecular weight DNA.
Inspect length distribution.
lens.fn <- lapply(nops2, function(fn) nchar(getSequences(fn)))
lens <- do.call(c, lens.fn)
hist(lens, 100)
Very strong peak ~1450, looks good.
Filter:
filts2 <- file.path(path2, "noprimers", "filtered", basename(fns2))
track2 <- filterAndTrim(nops2, filts2, minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2)
track2
## reads.in reads.out
## Callahan_16S_R_11.1.ccs99.9.fastq.gz 17436 17147
## Callahan_16S_R_3.1.ccs99.9.fastq.gz 26605 26190
## Callahan_16S_R_3.2.ccs99.9.fastq.gz 24477 24081
## Callahan_16S_R_3.3.ccs99.9.fastq.gz 16738 16474
## Callahan_16S_R_9.1.ccs99.9.fastq.gz 28623 28171
## Callahan_16S_R_9.1B.ccs99.9.fastq.gz 24832 24453
## Callahan_16S_R_9.2.ccs99.9.fastq.gz 21759 21411
## Callahan_16S_R_9.3.ccs99.9.fastq.gz 18701 18429
## Callahan_16S_R_9.4.ccs99.9.fastq.gz 13481 13258
Losing <2% of reads per sample. Mean/Median reads per sample is over 20K! No low depth outliers.
Dereplicate:
drp2 <- derepFastq(filts2, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_11.1.ccs99.9.fastq.gz
## Encountered 4468 unique sequences from 17147 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_3.1.ccs99.9.fastq.gz
## Encountered 4958 unique sequences from 26190 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_3.2.ccs99.9.fastq.gz
## Encountered 5326 unique sequences from 24081 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_3.3.ccs99.9.fastq.gz
## Encountered 3668 unique sequences from 16474 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_9.1.ccs99.9.fastq.gz
## Encountered 6385 unique sequences from 28171 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_9.1B.ccs99.9.fastq.gz
## Encountered 5529 unique sequences from 24453 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_9.2.ccs99.9.fastq.gz
## Encountered 5519 unique sequences from 21411 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_9.3.ccs99.9.fastq.gz
## Encountered 5152 unique sequences from 18429 total sequences read.
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Fecal2/noprimers/filtered/Callahan_16S_R_9.4.ccs99.9.fastq.gz
## Encountered 3772 unique sequences from 13258 total sequences read.
Roughly 1/4th unique sequences in these samples.
Learn errors:
err2 <- learnErrors(drp2, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE)
## 122486022 total bases in 83892 reads from 4 samples will be used for learning the error rates.
saveRDS(err2, file.path(path.rds, "Fecal_err2.rds"))
Inspect errors:
plotErrors(err2)
## Warning: Transformation introduced infinite values in continuous y-axis