Data

The mock community being sequenced is the ZymoBIOMICS Microbial Community Standard: https://www.zymoresearch.com/zymobiomics-community-standard . This mock community contains the following bacterial and yeast species (but we’ll ignore the yeast as they aren’t amplified by this protocol):

Of note, Zymo Research replaced five strains in the ZymoBIOMICS™ standards with similar strains beginning with Lot ZRC190633. The Lot # of the sample analyzed here was ZRC187325, which contains the “old” mixture of strains. For indiscernible reasons that suggest caution in ordering microbial “standards” products from this company, Zymo Research will not share the identify of all the strains in the older products. However, they did confirm the identify of the E. coli strain in the old products as E. coli O157:H7 str. CDC B6914-MS1.

Saving the 16S copy numbers for the bacteria from the table:

ncopy <- c("Pseudomonas"=4, "Escherichia"=7, "Salmonella"=7, "Lactobacillus"=5, 
             "Enterococcus"=4, "Staphylococcus"=6, "Listeria"=6, "Bacillus"=10)

This sequencing data was generated by PacBio using their in-house 16S amplification and sequencing protocol: https://www.pacb.com/wp-content/uploads/Unsupported-Full-Length-16S-Amplification-SMRTbell-LibraryPreparation-and-Sequencing.pdf

The sequencing was performed on a Sequel, with the S/P1-C1.2 sequencing chemistry. Standard PacBio bioinformatics were performed, and consensus sequences were extracting using CCS 3.1.1 with default parameters, minPasses=3 and minPredictedAccuracy=0.999:

ccs --pbi --force --logLevel=DEBUG --numThreads=16 --minSnr=3.75 --minReadScore=0.65 --maxLength=7000 --minLength=10 --minPasses=3 --minZScore=-5 --maxDropFraction=0.34 --minPredictedAccuracy=0.999 subreads.bam ccs.bam

Setup

Load libraries, setup paths, prepare environment:

library(dada2);packageVersion("dada2")
## [1] '1.12.1'
library(Biostrings);packageVersion("Biostrings")
## [1] '2.46.0'
library(ShortRead);packageVersion("ShortRead")
## [1] '1.36.1'
library(ggplot2);packageVersion("ggplot2")
## [1] '3.1.0'
library(reshape2);packageVersion("reshape2")
## [1] '1.4.3'
library(RColorBrewer);packageVersion("RColorBrewer")
## [1] '1.1.2'
path <- "~/Desktop/LRAS/Data/Zymo/" # CHANGE ME to location of the fastq file
path.out <- "Figures/" 
path.rds <- "RDS/" 
fn <- file.path(path, "zymo_CCS_99_9.fastq.gz")
F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"
rc <- dada2:::rc
theme_set(theme_bw())
genusPalette <- c(Bacillus="#e41a1c", Enterococcus="#377eb8", Escherichia="#4daf4a", Lactobacillus="#984ea3",
                  Listeria="#ff7f00", Pseudomonas="#ffff33", Salmonella="#a65628", Staphylococcus="#f781bf")

Remove Primers and Filter

Remove primers and orient reads:

nop <- file.path(path, "noprimers", basename(fn))
prim <- removePrimers(fn, nop, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE, verbose=TRUE)
## Multiple matches to the primer(s) in some sequences. Using the longest possible match.
## Multiple matches to the primer(s) in some reverse-complement sequences. Using the longest possible match.
## 39439 sequences out of 77453 are being reverse-complemented.
## Overwriting file:/Users/bcallah/Desktop/LRAS/Data/Zymo/noprimers/zymo_CCS_99_9.fastq.gz
## Read in 77453, output 73057 (94.3%) filtered sequences.

Very little loss there even though primer indels aren’t allowed currently.

Inspect length distribution.

hist(nchar(getSequences(nop)), 100)

Sharply peaked at the expected length range.

Filter:

filt <- file.path(path, "noprimers", "filtered", basename(fn))
track <- fastqFilter(nop, filt, minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)
## Overwriting file:~/Desktop/LRAS/Data/Zymo//noprimers/filtered/zymo_CCS_99_9.fastq.gz
## Read in 73057, output 69367 (94.9%) filtered sequences.

Very little lost to filtering.

Run DADA2

Dereplicate:

drp <- derepFastq(filt, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Zymo//noprimers/filtered/zymo_CCS_99_9.fastq.gz
## Encountered 20516 unique sequences from 69367 total sequences read.

Learn errors:

err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun) # 10s of seconds
## 102315005 total bases in 69367 reads from 1 samples will be used for learning the error rates.

Inspect errors:

plotErrors(err)
## Warning: Transformation introduced infinite values in continuous y-axis

Looks good.

Denoise:

dd <- dada(drp, err=err, BAND_SIZE=32, multithread=TRUE) # seconds
## Sample 1 - 69367 reads in 20516 unique sequences.
dd
## dada-class: object describing DADA2 denoising results
## 29 sequence variants were inferred from 20516 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 32

Read tracking:

cbind(ccs=prim[,1], primers=prim[,2], filtered=track[[2]], denoised=sum(dd$denoised))
##        ccs primers filtered denoised
## [1,] 77453   73057    69367    69261

Inspect clustering data.frame

dd$clustering[,-1]
##    abundance   n0  n1 nunq pval birth_type birth_pval    birth_fold
## 1       7129 6095 930 1839    0          I         NA            NA
## 2       6231 5251 906 1768    0          A          0           Inf
## 3       5433 4747 641 1391    0          A          0           Inf
## 4       5452 4688 711 1492    0          A          0           Inf
## 5       3471 3017 421  900    0          A          0  2.079674e+04
## 6       3777 3236 506 1110    0          A          0           Inf
## 7       2902 2487 383  866    0          A          0  1.825207e+08
## 8       2658 2325 306  764    0          A          0           Inf
## 9       2299 1983 286  693    0          A          0 1.511500e+173
## 10      2100 1808 273  641    0          A          0  1.893879e+09
## 11      2032 1755 252  640    0          A          0  1.115776e+03
## 12      1989 1755 219  612    0          A          0  1.571964e+08
## 13      1895 1665 215  565    0          A          0  1.540096e+10
## 14      1845 1648 177  515    0          A          0  8.434966e+07
## 15      1903 1647 232  592    0          A          0           Inf
## 16      1795 1533 239  590    0          A          0  2.935131e+14
## 17      1822 1547 236  622    0          A          0  3.791198e+09
## 18      1660 1437 201  527    0          A          0  1.760983e+05
## 19      1617 1366 233  533    0          A          0  9.952192e+04
## 20      1503 1297 185  486    0          A          0  8.651874e+07
## 21      1415 1217 179  455    0          A          0  1.485209e+09
## 22      1456 1236 202  500    0          A          0  9.326275e+04
## 23      1457 1263 182  489    0          A          0  1.041753e+09
## 24      1219 1037 168  319    0          A          0  3.043078e+02
## 25      1173  998 154  411    0          A          0  6.013010e+37
## 26       784  674 103  275    0          A          0  3.557368e+12
## 27       741  636  90  265    0          A          0  6.667358e+02
## 28       742  631  99  263    0          A          0  2.739388e+04
## 29       761  660  88  287    0          A          0  8.740252e+29
##    birth_ham birth_qave
## 1         NA         NA
## 2        266   90.51128
## 3        279   90.42294
## 4        272   90.49632
## 5          1   92.00000
## 6        259   90.56757
## 7          2   83.50000
## 8        178   90.83708
## 9         35   90.82857
## 10         2   87.50000
## 11         1   93.00000
## 12         2   92.00000
## 13         2   91.50000
## 14         2   88.00000
## 15       288   90.14236
## 16         3   87.00000
## 17         2   91.50000
## 18         1   92.00000
## 19         1   92.00000
## 20         2   92.50000
## 21         2   85.50000
## 22         1   92.00000
## 23         2   88.50000
## 24         1   73.00000
## 25         8   86.50000
## 26         3   92.00000
## 27         1   93.00000
## 28         1   92.00000
## 29         6   89.50000

Very strong differentiation of true sequences and errors based on the classification p-values reported.

Taxonomy and Chimeras

Assign taxonomy:

tax <- assignTaxonomy(dd, "~/tax/silva_nr_v128_train_set.fa.gz", multithread=TRUE)
tax[,"Genus"] <- gsub("Escherichia/Shigella", "Escherichia", tax[,"Genus"]) # Reformat to be compatible with other data sources
unname(tax)
##       [,1]       [,2]             [,3]                 
##  [1,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
##  [2,] "Bacteria" "Firmicutes"     "Bacilli"            
##  [3,] "Bacteria" "Firmicutes"     "Bacilli"            
##  [4,] "Bacteria" "Firmicutes"     "Bacilli"            
##  [5,] "Bacteria" "Firmicutes"     "Bacilli"            
##  [6,] "Bacteria" "Firmicutes"     "Bacilli"            
##  [7,] "Bacteria" "Firmicutes"     "Bacilli"            
##  [8,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
##  [9,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [10,] "Bacteria" "Firmicutes"     "Bacilli"            
## [11,] "Bacteria" "Firmicutes"     "Bacilli"            
## [12,] "Bacteria" "Firmicutes"     "Bacilli"            
## [13,] "Bacteria" "Firmicutes"     "Bacilli"            
## [14,] "Bacteria" "Firmicutes"     "Bacilli"            
## [15,] "Bacteria" "Firmicutes"     "Bacilli"            
## [16,] "Bacteria" "Firmicutes"     "Bacilli"            
## [17,] "Bacteria" "Firmicutes"     "Bacilli"            
## [18,] "Bacteria" "Firmicutes"     "Bacilli"            
## [19,] "Bacteria" "Firmicutes"     "Bacilli"            
## [20,] "Bacteria" "Firmicutes"     "Bacilli"            
## [21,] "Bacteria" "Firmicutes"     "Bacilli"            
## [22,] "Bacteria" "Firmicutes"     "Bacilli"            
## [23,] "Bacteria" "Firmicutes"     "Bacilli"            
## [24,] "Bacteria" "Firmicutes"     "Bacilli"            
## [25,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [26,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [27,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [28,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [29,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
##       [,4]                [,5]                 [,6]            
##  [1,] "Enterobacteriales" "Enterobacteriaceae" "Salmonella"    
##  [2,] "Bacillales"        "Bacillaceae"        "Bacillus"      
##  [3,] "Bacillales"        "Staphylococcaceae"  "Staphylococcus"
##  [4,] "Bacillales"        "Listeriaceae"       "Listeria"      
##  [5,] "Bacillales"        "Listeriaceae"       "Listeria"      
##  [6,] "Lactobacillales"   "Enterococcaceae"    "Enterococcus"  
##  [7,] "Bacillales"        "Bacillaceae"        "Bacillus"      
##  [8,] "Pseudomonadales"   "Pseudomonadaceae"   "Pseudomonas"   
##  [9,] "Enterobacteriales" "Enterobacteriaceae" "Escherichia"   
## [10,] "Lactobacillales"   "Enterococcaceae"    "Enterococcus"  
## [11,] "Lactobacillales"   "Enterococcaceae"    "Enterococcus"  
## [12,] "Bacillales"        "Staphylococcaceae"  "Staphylococcus"
## [13,] "Bacillales"        "Staphylococcaceae"  "Staphylococcus"
## [14,] "Bacillales"        "Staphylococcaceae"  "Staphylococcus"
## [15,] "Lactobacillales"   "Lactobacillaceae"   "Lactobacillus" 
## [16,] "Lactobacillales"   "Lactobacillaceae"   "Lactobacillus" 
## [17,] "Lactobacillales"   "Lactobacillaceae"   "Lactobacillus" 
## [18,] "Bacillales"        "Listeriaceae"       "Listeria"      
## [19,] "Bacillales"        "Bacillaceae"        "Bacillus"      
## [20,] "Lactobacillales"   "Lactobacillaceae"   "Lactobacillus" 
## [21,] "Bacillales"        "Bacillaceae"        "Bacillus"      
## [22,] "Bacillales"        "Bacillaceae"        "Bacillus"      
## [23,] "Lactobacillales"   "Lactobacillaceae"   "Lactobacillus" 
## [24,] "Bacillales"        "Bacillaceae"        "Bacillus"      
## [25,] "Enterobacteriales" "Enterobacteriaceae" "Salmonella"    
## [26,] "Enterobacteriales" "Enterobacteriaceae" "Escherichia"   
## [27,] "Enterobacteriales" "Enterobacteriaceae" "Escherichia"   
## [28,] "Enterobacteriales" "Enterobacteriaceae" "Escherichia"   
## [29,] "Enterobacteriales" "Enterobacteriaceae" "Escherichia"

No yeasts, presumably lost due to the bacterial primers, but all the expected bacterial genera are there.

Check chimeras:

bim <- isBimeraDenovo(dd, minFoldParentOverAbundance=3.5, multithread=TRUE) 
# Higher MFPOA to avoid flagging intra-genomic variants
table(bim)
## bim
## FALSE 
##    29

No chimeras!

Save processed objects for future analysis.

saveRDS(dd, file.path(path.rds, "Zymo_dd.rds"))
saveRDS(tax, file.path(path.rds, "Zymo_tax_Silva128.rds"))

Reload processed data objects (can run code below from here in absence of input sequence data):

dd <- readRDS(file.path(path.rds, "Zymo_dd.rds"))
tax <- readRDS(file.path(path.rds, "Zymo_tax_Silva128.rds"))

Reference check againt nt

BLAST these 29 sequences against nt (https://blast.ncbi.nlm.nih.gov/Blast.cgi):

## dada2:::pfasta(dd, id=tax[,6])
uniquesToFasta(dd, file.path(path.rds, "Zymo_ASVs.fa"), ids=paste0(tax[,"Genus"], ";size=", dd$denoised, ";"))
is.ec <- tax[,"Genus"] %in% "Escherichia"
uniquesToFasta(dd$clustering[is.ec,], file.path(path.rds, "Zymo_Ecoli.fa"), ids=paste0("Ecoli", seq(sum(is.ec)), "_Zymo"))
## BLAST against nt excluding uncultured/environmental sequences (checkbox)

Results of BLAST search on July 18, 2018 are recorded.

  1. Exact match (100% identity, 100% coverage) to S. enterica strains: 2014LSAL02547, 08-00436, SA20035215, C500, CFSAN002050, NBRC 105684, JCM 1651
  2. Exact match to (~10) B. subtilis/xiamenensis/instestinalis strains, including NRS 231 and ATCC 6633
  3. Exact match to many (>100) S. aureus strains.
  4. Exact match to many (>100) L. monocytogenes strains.
  5. Exact match to many (~40) L. monocytogenes strains.
  6. Exact match to many (~50) E. faecalis strains.
  7. Exact match to B. subtilis/intestinalis strains: T30, NRS 231, and W23
  8. Exact match to many (>100) Pseudomonas aeruginosa strains.
  9. Exact match to many (~50) E. coli strains.
  10. Exact match to E. faecalis strains: KB1, D32, SE-8, HN-S5
  11. Exact match to E. faecalis strain: NRIC 0110
  12. Exact match to S. aureus strains: NCTC9944, MRSA107, TOHH628, V605, V521, Gv88, TW20
  13. One mismatch (1st base) to many S. aureus strains.
  14. Exact match to many S. aureus strains (99).
  15. One mismatch to L. fermentum strains: 3872, FTDC 8312, HBUAS54017
  16. Exact match to L. fermentum strains: CBA7106, FTDC 8312, 3872
  17. Exact match to many (~20) L. fermentum strains.
  18. Exact match to many (~15) L. monocytogenes strains.
  19. Exact match to B. subtilis/intestinalis strains: T30, NRS 231, W23
  20. One mismatch to many (~25) L. fermentum strains.
  21. Exact match to B. subtilis/intestinalis strains: T30, NRS 231, W23
  22. One mismatch to ~10 B. subtilis/xiamenensis/instestinalis strains, including NRS 231 and ATCC 6633.
  23. One mismatch AND one 10nt insertion to Lactobacillus fermentum 3872
  24. Exact match to B. subtilis strains: NRS 231, W23
  25. Two mismatches to many (~20) S. enterica strains.
  26. Exact match to (~10) E. coli strains.
  27. Exact match to many (>100) E. coli strains.
  28. Exact match to many (~20) E. coli strains.
  29. Exact match to (~15) E. coli strains.

Everything is an exact match except for: 13, 15, 20, 22, 23, 25. 13 (Staphylococcus), 15/20/23 (Lactobacillus), 22 (Bacillus) and 25 (Salmonella).

Also 11 (Enterococcus) matches just 1 incomplete strain in NCBI nr/nt. The rest all match multiple.

Store these results in R format:

refhit.nt <- c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
               TRUE, TRUE, FALSE,TRUE, FALSE,TRUE, TRUE, TRUE, TRUE, FALSE,
               TRUE, FALSE,FALSE,TRUE, FALSE,TRUE, TRUE, TRUE, TRUE)
which(!refhit.nt)
## [1] 13 15 20 22 23 25

Correctly recorded.

Reference check against provided 16S references

Zymo provides 16S reference sequences for these strains. They can be downloaded from the manufacturer’s website. However, these references are sadly lacking. Remember, this is “Version 1” of the Microbial standard. However, the version 1 16S reference sequences are totally incomplete, typically containing just a single 16S sequence per organism. The “Version 2” references contain a full complement of 16S sequences for each strain, but 5/8 strains are wrong, including E. coli! Our best option is to use Version 2, and then combine with the nt database hits, and simply be honest that what we are describing are (mostly exact) matches to previously observed sequences from these species, rather than matches to an authoritative references which does not exist.

Import the 16S references and compare the recovered ASVs:

path.16S <- file.path(path, "RefSeqs/ZymoBIOMICS.STD.refseq.v2/ssrRNAs") ### Version 2, 5/8 strains are wrong
### path.16S <- file.path(path, "RefSeqs/BioPool_genomes/16S-18S") #### Version 1, just 1 allele per strain
fn16S <- list.files(path.16S, pattern=".fasta$")
names(fn16S) <- sapply(strsplit(fn16S, "_"), `[`, 1)
fn16S <- fn16S[names(ncopy)] # Drop yeast
ncopy.ref <- sapply(file.path(path.16S, fn16S), function(x) length(getSequences(x)))
names(ncopy.ref) <- names(fn16S)
ncopy.ref
##    Pseudomonas    Escherichia     Salmonella  Lactobacillus   Enterococcus 
##              4              7              7              5              4 
## Staphylococcus       Listeria       Bacillus 
##              6              6             10

All 8 strains are present in the reference files, and have the expected number of 16S sequences.

Compare the sequences to all the reference sequences, allowing no mismatches (but with fixed=FALSE so as to match ambiguous nucleotides in the references):

sq <- dd$sequence
ref.16S <- DNAStringSet(unlist(sapply(file.path(path.16S, fn16S), getSequences)))
names(ref.16S) <- sapply(strsplit(names(ref.16S), "[.]fasta[.]"), `[`, 2)
names(ref.16S) <- gsub("\\sConcatenation of 2 sequences", "", names(ref.16S))
refmat.16S <- sapply(DNAStringSet(sq), vcountPattern, subject=ref.16S, fixed=TRUE)
rownames(refmat.16S) <- names(ref.16S)
colnames(refmat.16S) <- paste0("Seq", seq_along(sq))

Inspect:

refhit.16S <- colSums(refmat.16S)>0
which(refhit.16S)
##  Seq1  Seq2  Seq4  Seq5  Seq6  Seq7  Seq8 Seq14 Seq17 Seq18 Seq19 Seq21 
##     1     2     4     5     6     7     8    14    17    18    19    21 
## Seq22 Seq24 Seq25 
##    22    24    25
table(tax[,"Genus"], Reference.Match=refhit.16S)
##                 Reference.Match
##                  FALSE TRUE
##   Bacillus           0    6
##   Enterococcus       2    1
##   Escherichia        5    0
##   Lactobacillus      4    1
##   Listeria           0    3
##   Pseudomonas        0    1
##   Salmonella         0    2
##   Staphylococcus     3    1

They mostly match the 16S references provided, although some don’t at all (e.g. E. coli, which we know is a different strain in our version of the standard than in these references sequences). Note though that nt non-hits Seq22 and Seq25 match here.

Consolidated reference comparisons

We’ll now consider the total comparisons to all three references sources: nr/nt, Zymo-provided 16S, and Zymo-provided genomes.

refmat <- matrix(c(refhit.nt, refhit.16S), ncol=2)
colnames(refmat) <- c("nr.nt", "zymo.16S")
rownames(refmat) <- paste0("Seq", seq_along(sq))
refmat
##       nr.nt zymo.16S
## Seq1   TRUE     TRUE
## Seq2   TRUE     TRUE
## Seq3   TRUE    FALSE
## Seq4   TRUE     TRUE
## Seq5   TRUE     TRUE
## Seq6   TRUE     TRUE
## Seq7   TRUE     TRUE
## Seq8   TRUE     TRUE
## Seq9   TRUE    FALSE
## Seq10  TRUE    FALSE
## Seq11  TRUE    FALSE
## Seq12  TRUE    FALSE
## Seq13 FALSE    FALSE
## Seq14  TRUE     TRUE
## Seq15 FALSE    FALSE
## Seq16  TRUE    FALSE
## Seq17  TRUE     TRUE
## Seq18  TRUE     TRUE
## Seq19  TRUE     TRUE
## Seq20 FALSE    FALSE
## Seq21  TRUE     TRUE
## Seq22 FALSE     TRUE
## Seq23 FALSE    FALSE
## Seq24  TRUE     TRUE
## Seq25 FALSE     TRUE
## Seq26  TRUE    FALSE
## Seq27  TRUE    FALSE
## Seq28  TRUE    FALSE
## Seq29  TRUE    FALSE

Consider whether the sequences appear, exaclty, in any of the reference sources:

refhit.any <- rowSums(refmat) > 0
refhit.any
##  Seq1  Seq2  Seq3  Seq4  Seq5  Seq6  Seq7  Seq8  Seq9 Seq10 Seq11 Seq12 
##  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE 
## Seq13 Seq14 Seq15 Seq16 Seq17 Seq18 Seq19 Seq20 Seq21 Seq22 Seq23 Seq24 
## FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE 
## Seq25 Seq26 Seq27 Seq28 Seq29 
##  TRUE  TRUE  TRUE  TRUE  TRUE

Which don’t have any hits:

refhit.any[!refhit.any]
## Seq13 Seq15 Seq20 Seq23 
## FALSE FALSE FALSE FALSE
unname(tax[!refhit.any, "Genus"])
## [1] "Staphylococcus" "Lactobacillus"  "Lactobacillus"  "Lactobacillus"

All the rest have exact reference hits. The non-hits are mostly the 3/5 L. fermentum sequences, and also one S. aureus sequence variant.

Intra-genomic Stoichiometry

Now we’ll investigate the “stoichiometry” of the sequence variants from each bacterial strain, i.e. the copy number of each 16S sequence variants relative to the genomic copy number, all inferred from the dada2 processed data and the rRNA gene copy numbers given for each strain in the Zymo reference.

First calculate the genomic abundances, by summing over all sequence variants for each genus and dividing by the 16S copy number:

abund.ome <- sapply(names(ncopy), function(gen) {
  is.gen <- grepl(gen, tax[,"Genus"])
  sum(dd$denoised[is.gen])/ncopy[gen]
})
names(abund.ome) <- names(ncopy)
dfgen <- data.frame(Genus=names(ncopy), Abundance=abund.ome, stringsAsFactors = FALSE)
ggplot(data=dfgen, aes(x=Genus, y=Abundance)) + 
  geom_col(width=0.4, aes(fill=Genus)) + scale_fill_manual(values=genusPalette) +
  ylim(c(0, NA)) + geom_hline(yintercept=mean(abund.ome), linetype="dashed") +
  theme(axis.text.x=element_blank(), axis.ticks.x = element_blank()) +
  ylab("Genome Abundance")

They are supposed to be all equal. Clearly they aren’t, but all are present at reasonable frequencies (i.e. <3x abundance biases).

Now we make a similar plot for the abundance of each sequence variant:

dfasv <- data.frame(Genus=tax[,"Genus"], Abundance=dd$denoised, stringsAsFactors = FALSE)
rownames(dfasv) <- NULL
ggplot(data=dfasv, aes(x=Genus, y=Abundance)) + 
  geom_point(aes(color=Genus), shape="x", size=4) + scale_color_manual(values=genusPalette) +
  ylim(c(0, NA)) +
  theme(axis.text.x=element_blank(), axis.ticks.x = element_blank()) +
  ylab("ASV Abundance")

And now we make the “stoichiometry” figure, i.e the abundance of each ASV scaled to the genomic abundance:

dfasv$ScaledAbundance <- dfasv$Abundance/abund.ome[dfasv$Genus]
# Number the ASVs in each strain/genus
dfasv$Variant <- sapply(seq(nrow(dfasv)), function(i) sum(dfasv$Genus[1:i] == dfasv$Genus[[i]], na.rm=TRUE))

p.stoich <- ggplot(data=dfasv, aes(x=Variant, y=ScaledAbundance, fill=Genus, width=0.5)) + geom_col() + 
  scale_fill_manual(values=genusPalette) +
  facet_wrap(~Genus, nrow=2) +
  scale_y_continuous(breaks=seq(0,round(max(dfasv$ScaledAbundance))), minor_breaks=NULL) +
  theme(panel.grid.major.y=element_line(color="grey60", size=0.2)) +
  theme(panel.grid.major.x=element_blank(), panel.grid.minor.x=element_blank()) +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
  xlab("Full-length 16S Sequence Variants") + 
  ylab("Abundance (per-genome)") + 
  guides(fill=FALSE)
p.stoich

ggsave(file.path(path.out, "Zymo_Stoich.pdf"), p.stoich, width=178, height=100, units="mm", useDingbats=FALSE)

That is beautiful. There is an integer number of each sequence variant in each genome, just as there must be. That is, these aren’t sequencing errors, these are real sequence variants. All of them.

Make another figure that compares the integral values to those if we change genome copy number to be some random-ish value:

set.seed(100)

dfasv$Ncopy <- dfasv$ScaledAbundance
# Function. Genus. Named vector with genome copy numbers. Assumes dd and tax in outer environment.
get.abund.ome <- function(genera, ncopies) {
  rval <- sapply(genera, function(gen) {
    is.gen <- grepl(gen, tax[,"Genus"])
    sum(dd$denoised[is.gen])/ncopies[gen]
  })
  names(rval) <- genera
  rval
}
# Permute abund.ome
foo <- sample(length(ncopy))
while(any(foo == seq(length(ncopy)))) foo <- sample(length(ncopy)) # Ensure all switch
abund.perm <- abund.ome; abund.perm[1:length(abund.perm)] <- abund.perm[foo]
dfasv$PermuteAbundance <- dfasv$Abundance/abund.perm[dfasv$Genus]
# Permute copy numbers
foo <- sample(length(ncopy))
while(any(foo == seq(length(ncopy)))) foo <- sample(length(ncopy)) # Ensure all switch
ncopy.perm <- ncopy; ncopy.perm[1:length(ncopy.perm)] <- ncopy.perm[foo]
abund.ncopy.perm <- get.abund.ome(names(ncopy), ncopy.perm)
dfasv$PermuteNcopy <- dfasv$Abundance/abund.ncopy.perm[dfasv$Genus]
# Set copy numbers to the number of unique sequence variants
nasv <- table(tax[,"Genus"])
abund.nasv <- get.abund.ome(names(ncopy), nasv)
dfasv$NasvsNcopy <- dfasv$Abundance/abund.nasv[dfasv$Genus]
# Set copy numbers to the median copy number
ncopy.median <- ncopy; ncopy.median[1:length(ncopy.median)] <- median(ncopy)
abund.ncopy.median <- get.abund.ome(names(ncopy), ncopy.median)
dfasv$MedianNcopy <- dfasv$Abundance/abund.ncopy.median[dfasv$Genus]

df4 <- melt(dfasv, measure.vars=c("Ncopy", "PermuteAbundance", "PermuteNcopy", "NasvsNcopy", "MedianNcopy"), 
            value.name="Count", variable.name="Method")

ggplot(data=df4, aes(x=Method, y=Count, color=Genus)) + geom_jitter(height=0, width=0.15) +
  theme_bw() + theme(panel.grid.major.y=element_line(color="black", size=0.25)) +
#  theme(aspect.ratio=10) + 
#  scale_x_continuous(limits=c(0.7,1.3), breaks=NULL) +
  scale_y_continuous(breaks=seq(round(max(df4$Count))), minor_breaks=NULL)

It is only with the true copy numbers that we get the integer ladder.

PacBio Quality Diagnostics

We now consider the distribution of PacBio CCS sequencing errors, using the reconstructed ASVs as the ground truth as confirmed above.

Rerun dada correcting all reads:

dda <- dada(drp, err=err, BAND_SIZE=32, OMEGA_C=0, multi=TRUE)
## Sample 1 - 69367 reads in 20516 unique sequences.
dda
## dada-class: object describing DADA2 denoising results
## 29 sequence variants were inferred from 20516 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 0, BAND_SIZE = 32

Record all errors between the raw reads and the true ASVs by position (1-1500ish) and by type (D:Deletion, I:Insertion, S:Substitution).

library(Biostrings)
diffs <- function(query, ref, vec=TRUE, ...) { # Currently not capturing insertions
  al <- nwalign(query, ref, vec=vec, ...)
  cstr <- compareStrings(al[[1]], al[[2]])
  cstr.ref <- gsub("[+]", "", cstr) # Ref coordinates, but lose insertion info
  cstr.q <- gsub("[-]", "", cstr) # Query coordinates, but lose deletion info
  cstr.ins <- gsub("[+]+", "+", cstr) # Reduce all multi-base inserts to a single insert
  cstr.del <- gsub("[-]+", "-", cstr) # Reduce all multi-base deletions to a single deletion
  refpos.sub <- unlist(gregexpr(pattern='[?]', cstr.ref))
  refpos.del <- unlist(gregexpr(pattern='[-]+', cstr.ref)) # Just getting loc of first deleted base of multi-nt deletions
  refpos.ins <- unlist(gregexpr(pattern='[+]', cstr.ins))
  refpos.ins <- refpos.ins - seq_along(refpos.ins) + 1 # Correct back to ref coords
  qpos.sub <- unlist(gregexpr(pattern='[?]', cstr.q))
  qpos.ins <- unlist(gregexpr(pattern='[+]+', cstr.q)) # Just getting loc of first inserted base of multi-nt inserts
  qpos.del <- unlist(gregexpr(pattern='[-]', cstr.del))
  qpos.del <- qpos.del - seq_along(qpos.del) + 1 # Correct back to ref coords
  rv <- rbind( data.frame(Type="S", RefPos=refpos.sub, QueryPos=qpos.sub),
               data.frame(Type="D", RefPos=refpos.del, QueryPos=qpos.del),
               data.frame(Type="I", RefPos=refpos.ins, QueryPos=qpos.ins))
  rv[rv$RefPos > -1,]
}

df.diffs <- function(i) {
  df <- diffs(getSequences(drp)[i], dd$sequence[dd$map[i]])
  df <- cbind(df, Abund = rep(drp$uniques[i], nrow(df)), 
              Derep=rep(i, nrow(df)), Denoised=rep(dd$map[i], nrow(df)))
  df$Qual <- drp$quals[cbind(df$Derep, df$QueryPos)]
  df
}

Identify chimeras in the denoised reads:

bim.drp <- isBimeraDenovo(drp, minFoldParentOverAbundance=3.5, multi=TRUE) # A couple minutes
table(bim.drp)
## bim.drp
## FALSE  TRUE 
## 18629  1887
unname(drp$uniques[unname(head(which(bim.drp)))])
## [1] 51 47 40 24 23 21

Identify which uniques were corrected/uncorrected:

corrected.drp <- !is.na(dd$map)
table(corrected.drp)
## corrected.drp
## FALSE  TRUE 
##   106 20410

Identify contaminants as those reads that are not assigned to one of the genera in the mock community.

tax.drp.uncorrected <- assignTaxonomy(getSequences(drp)[!corrected.drp], "~/tax/silva_nr_v128_train_set.fa.gz", multi=TRUE) # 
table(tax.drp.uncorrected[,"Genus"], useNA="ifany")
## 
##       Azorhizophilus             Bacillus         Enterococcus 
##                    1                   24                    6 
## Escherichia/Shigella              Gemmata        Lactobacillus 
##                    4                    1                   17 
##             Listeria    Pseudocitrobacter          Pseudomonas 
##                   16                    2                    1 
##           Salmonella       Staphylococcus         Trabulsiella 
##                    6                   14                    1 
##                 <NA> 
##                   13
table(bim.drp[!corrected.drp][is.na(tax.drp.uncorrected[,"Genus"])], useNA="ifany")
## 
## FALSE  TRUE 
##     5     8
unname(tax.drp.uncorrected[!bim.drp[!corrected.drp] & is.na(tax.drp.uncorrected[,"Genus"]),])
##      [,1]       [,2]             [,3]                  [,4]               
## [1,] "Bacteria" NA               NA                    NA                 
## [2,] "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Enterobacteriales"
## [3,] "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Enterobacteriales"
## [4,] "Bacteria" "Firmicutes"     "Bacilli"             "Bacillales"       
## [5,] "Bacteria" "Firmicutes"     "Bacilli"             "Lactobacillales"  
##      [,5]                 [,6]
## [1,] NA                   NA  
## [2,] "Enterobacteriaceae" NA  
## [3,] "Enterobacteriaceae" NA  
## [4,] "Bacillaceae"        NA  
## [5,] "Enterococcaceae"    NA

Construct a data.frame of all errors between non-contaminant/non-chimeric reads and the true ASVs.

gens <- unique(tax[,"Genus"])
keep <- !is.na(dd$map) # Keep all corrected at default settings
keep[!keep] <- tax.drp.uncorrected[,"Genus"] %in% gens # Also keep uncorrected if in the expected genera
keep[bim.drp] <- FALSE # Don't keep chimeras
ii <- which(keep) # Index version of keep
diffs.keep <- lapply(ii, df.diffs) # ~2 mins
bar <- do.call(rbind, diffs.keep)
bar <- bar[order(bar$Qual, decreasing=TRUE),]

Count up the number of reads that were kept that extend to each nucleotide position.

nnt.keep <- sapply(seq(max(nchar(getSequences(drp)[keep]))), function(pos) {
  sum(drp$uniques[keep & nchar(getSequences(drp)) >= pos])
})

Barplots with quality colored as a function of query sequence position:

pbar <- bar
pbar <- pbar[!is.na(pbar$Qual),]
pbar$Quality <- cut(pbar$Qual, c(0, 50, 70, 90, 93))
qual.map <- c("(0-50]"="0-50", "(51,70]"="51-70", "(70,90]"="71-90", "(70,90]"="91+")
pbar$Quality <- qual.map[pbar$Quality]
pbar$Quality[pbar$Type=="D"] <- "NA"
pbar$Quality <- factor(pbar$Quality, levels=c(qual.map, "NA"))
type.map <- c("S"="Substitution", "D"="Deletion", "I"="Insertion")
pbar$Error <- factor(type.map[pbar$Type], levels=c("Substitution", "Insertion", "Deletion"))
color.scale <- c("hotpink", colorRampPalette(c("deeppink2", "dodgerblue2"))(3))
color.scale <- c(colorRampPalette(c("black", "cyan"))(4), "grey")
# Convert counts to rates by using the total lengths of all reads
pbar$Rate <- pbar$Abund/nnt.keep[pbar$QueryPos]
# Force desired facet ymax limits with a custom data.frame, and set desired breaks
dflim <- data.frame(Rate=c(0.001, 0.01, 0.003), 
                    QueryPos=c(200, 200, 200), 
                    Quality=c("NA", "NA", "NA"),
                    Error=c("Substitution", "Insertion", "Deletion"))
my_breaks <- function(x) { 
  if (max(x) < 0.0015) { c("0.0000"=0, "0.0010"=0.001) }  # "0.0005"=0.0005, 
  else if(max(x) < 0.005) { c("0.0000"=0, "0.0010"=0.001, "0.0020"=0.002, "0.0030"=0.003) }
#  else { c("0.0000"=0, "0.0050"=0.005, "0.0100"=0.01) }
  else { c("0.0000"=0, "0.0020"=0.002, "0.0040"=0.004, "0.0060"=0.006, "0.0080"=0.008, "0.0100"=0.01) }
}
p.err.pos <- ggplot(data=pbar, aes(x=QueryPos,y=Rate,color=NULL, fill=Quality)) + geom_col() +
  facet_grid(Error~., scales="free_y") + guides(color=FALSE) + xlab("Nucleotide Position") + ylab("Error Rate") +
  scale_color_manual(values=color.scale) + scale_fill_manual(values=color.scale) +
  geom_blank(data=dflim) + scale_y_continuous(breaks=my_breaks) + theme(axis.text.y=element_text(size=7))
p.err.pos

ggsave(file.path(path.out, "Zymo_errpos.pdf"), p.err.pos, width=178, height=100, units="mm", useDingbats=FALSE)
p.err.pos.200 <- ggplot(data=pbar[pbar$QueryPos %in% seq(151,350),], aes(x=QueryPos,y=Rate, color=NULL, fill=Quality)) + geom_col() +
  facet_grid(Error~., scales="free_y") + guides(color=FALSE) + xlab("Nucleotide Position") + ylab("Error Rate") +
  scale_color_manual(values=color.scale) + scale_fill_manual(values=color.scale) +
  geom_blank(data=dflim) + scale_y_continuous(breaks=my_breaks) + theme(axis.text.y=element_text(size=7))
p.err.pos.200

ggsave(file.path(path.out, "Zymo_errpos_200.pdf"), p.err.pos.200, width=178, height=78, units="mm", useDingbats=FALSE)
ggsave(file.path(path.out, "Zymo_errpos_200.png"), p.err.pos.200, dpi=300)
## Saving 7 x 5 in image
#  scale_color_gradientn(colors=rainbow(5)) + facet_wrap(~Type)
# In final polishing, used Illustrator toadd the intermediate ticks to the Insertion facet as well
# so the ticks are the same scale (0.001) in each plot

That is pretty cool. The two big peaks in substitutionss are strongly associated with lower quality scores, and quality is very informative about insertions! Quality doesn’t work for deletions of course because deletions as they aren’t in the query sequence, and hence don’t really have an associated quality score.

Calculate aggregate error probabilities for total errors and each type individually:

sq.drp <- getSequences(drp)
totbases.keep <- sum( (nchar(sq.drp)*drp$uniques)[keep] )
tapply(bar$Abund, bar$Type, sum); sum(bar$Abund)
##     S     D     I 
## 10423 10117 21558
## [1] 42098
tapply(bar$Abund, bar$Type, sum)/totbases.keep; sum(bar$Abund)/totbases.keep
##            S            D            I 
## 0.0001074245 0.0001042707 0.0002221871
## [1] 0.0004338822

Plot error rates versus quality scores for subsitutions and insertions.

srq <- readFastq(filt)
qmat <- as(quality(srq), "matrix") # Raw quality scores
totqs.keep <- tabulate(as.vector(qmat[keep,])) # entries 1-93 for the Q scores
imat.subs <- cbind(drpi = unlist(sapply(bar$Derep[bar$Type=="S"], function(i) which(drp$map==i))),
                   pos = rep(bar$QueryPos[bar$Type=="S"], times=bar$Abund[bar$Type=="S"]))
subqs.keep <- tabulate(as.vector(qmat[imat.subs]))
imat.ins <- cbind(drpi = unlist(sapply(bar$Derep[bar$Type=="I"], function(i) which(drp$map==i))),
                  pos = rep(bar$QueryPos[bar$Type=="I"], times=bar$Abund[bar$Type=="I"]))
insqs.keep <- tabulate(as.vector(qmat[imat.ins]))
df.qplot <- data.frame(Quality=rep(seq(93),2), 
                       Count=c(subqs.keep, insqs.keep), 
                       Rate=c(subqs.keep/totqs.keep, insqs.keep/totqs.keep),
                       Type=rep(c("Substitution", "Insertion"), each=93))
ggplot(data=df.qplot, aes(x=Quality, y=Rate, color=Type)) + 
  geom_hline(aes(yintercept=sum(bar$Abund[bar$Type=="D"])/totbases.keep), color="grey", linetype="dashed") +
  geom_point() + 
  scale_y_log10(breaks=c("0.0001"=0.0001, "0.001"=0.001, "0.01"=0.01, "0.1"=0.1), name="Error Rate") + 
  scale_color_manual(values=c(Insertion="maroon", Substitution="orange"), name="Error Type") +
  scale_x_continuous(breaks=c(0, 20, 40, 60, 80, 93), limits=c(0, 93), name="Quality Score") +
  theme_bw()
## Warning: Removed 6 rows containing missing values (geom_point).

The saturation of substitution errors at high Q are probably explained by PCR. The complementation symmetry especially at high Q values in the plotErrors further suggests PCR origin for most of these errors.

plotErrors(err)
## Warning: Transformation introduced infinite values in continuous y-axis

Some summary statistics and plots on the distribution of quality scores:

totqs.keep[[93]]/sum(totqs.keep) # Fraction with max q=93 quality score
## [1] 0.8448514
ggplot(data=data.frame(Quality=seq(93), Count=totqs.keep), aes(x=Quality, y=Count)) +
  geom_step() + theme_bw() + scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis

ggplot(data=data.frame(Quality=seq(93), Cumulative=cumsum(totqs.keep)/sum(totqs.keep)), aes(x=Quality, y=Cumulative)) +
  geom_step() + theme_bw()

Also calculate the number of CCS reads that were error-free at various stages.

n.error.free <- sum(drp$uniques[dd$sequence])
cbind(ccs=prim[,1], primers=prim[,2], filtered=track[[2]], denoised=sum(dd$denoised), error.free=n.error.free)
##        ccs primers filtered denoised error.free
## [1,] 77453   73057    69367    69261      40459
n.error.free/prim[[1]] # Fraction error-free out of all CCS reads
## [1] 0.5223684
n.error.free/track[[1]] # Fraction error-free out of CCS reads with detected primers, but before filtering
## [1] 0.5538005