Data

The mock community being sequenced is the ZymoBIOMICS Microbial Community Standard: https://www.zymoresearch.com/zymobiomics-community-standard . This mock community contains the following bacterial and yeast species (but we’ll ignore the yeast as they aren’t amplified by this protocol):

Of note, Zymo Research replaced five strains in the ZymoBIOMICS™ standards with similar strains beginning with Lot ZRC190633. The Lot # of the sample analyzed here was ZRC187325, which contains the “old” mixture of strains. For indiscernible reasons that suggest caution in ordering microbial “standards” products from this company, Zymo Research will not share the identify of all the strains in the older products. However, they did confirm the identify of the E. coli strain in the old products as E. coli O157:H7 str. CDC B6914-MS1.

Saving the 16S copy numbers for the bacteria from the table:

ncopy <- c("Pseudomonas"=4, "Escherichia"=7, "Salmonella"=7, "Lactobacillus"=5, 
             "Enterococcus"=4, "Staphylococcus"=6, "Listeria"=6, "Bacillus"=10)

This sequencing data was generated by PacBio using their in-house 16S amplification and sequencing protocol: https://www.pacb.com/wp-content/uploads/Unsupported-Full-Length-16S-Amplification-SMRTbell-LibraryPreparation-and-Sequencing.pdf

The sequencing was performed on a Sequel, with the S/P1-C1.2 sequencing chemistry. Standard PacBio bioinformatics were performed, and consensus sequences were extracting using CCS 3.1.1 with default parameters, minPasses=3 and minPredictedAccuracy=0.999:

ccs --pbi --force --logLevel=DEBUG --numThreads=16 --minSnr=3.75 --minReadScore=0.65 --maxLength=7000 --minLength=10 --minPasses=3 --minZScore=-5 --maxDropFraction=0.34 --minPredictedAccuracy=0.999 subreads.bam ccs.bam

Setup

Load libraries, setup paths, prepare environment:

library(dada2);packageVersion("dada2")
## [1] '1.12.1'
library(Biostrings);packageVersion("Biostrings")
## [1] '2.46.0'
library(ShortRead);packageVersion("ShortRead")
## [1] '1.36.1'
library(ggplot2);packageVersion("ggplot2")
## [1] '3.1.0'
library(reshape2);packageVersion("reshape2")
## [1] '1.4.3'
library(RColorBrewer);packageVersion("RColorBrewer")
## [1] '1.1.2'
path <- "~/Desktop/LRAS/Data/Zymo/" # CHANGE ME to location of the fastq file
path.out <- "Figures/" 
path.rds <- "RDS/" 
fn <- file.path(path, "zymo_CCS_99_9.fastq.gz")
F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"
rc <- dada2:::rc
theme_set(theme_bw())
genusPalette <- c(Bacillus="#e41a1c", Enterococcus="#377eb8", Escherichia="#4daf4a", Lactobacillus="#984ea3",
                  Listeria="#ff7f00", Pseudomonas="#ffff33", Salmonella="#a65628", Staphylococcus="#f781bf")

Remove Primers and Filter

Remove primers and orient reads:

nop <- file.path(path, "noprimers", basename(fn))
prim <- removePrimers(fn, nop, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE, verbose=TRUE)
## Multiple matches to the primer(s) in some sequences. Using the longest possible match.
## Multiple matches to the primer(s) in some reverse-complement sequences. Using the longest possible match.
## 39439 sequences out of 77453 are being reverse-complemented.
## Overwriting file:/Users/bcallah/Desktop/LRAS/Data/Zymo/noprimers/zymo_CCS_99_9.fastq.gz
## Read in 77453, output 73057 (94.3%) filtered sequences.

Very little loss there even though primer indels aren’t allowed currently.

Inspect length distribution.

hist(nchar(getSequences(nop)), 100)

Sharply peaked at the expected length range.

Filter:

filt <- file.path(path, "noprimers", "filtered", basename(fn))
track <- fastqFilter(nop, filt, minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)
## Overwriting file:~/Desktop/LRAS/Data/Zymo//noprimers/filtered/zymo_CCS_99_9.fastq.gz
## Read in 73057, output 69367 (94.9%) filtered sequences.

Very little lost to filtering.

Run DADA2

Dereplicate:

drp <- derepFastq(filt, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Zymo//noprimers/filtered/zymo_CCS_99_9.fastq.gz
## Encountered 20516 unique sequences from 69367 total sequences read.

Learn errors:

err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun) # 10s of seconds
## 102315005 total bases in 69367 reads from 1 samples will be used for learning the error rates.

Inspect errors:

plotErrors(err)
## Warning: Transformation introduced infinite values in continuous y-axis