The mock community being sequenced is the BEI Resources Microbial Mock Community B (Staggered, Low Concentration) community: https://www.beiresources.org/Catalog/otherProducts/HM-783D.aspx. This mock community (also known as the HMP mock community) contains the following bacterial in the following table, adapted from Table 1 in the BEI document, with alternative strain names and 16S copy numbers added based on the NCBI entry.
Organism | NCBI Reference Sequence |
---|---|
Acinetobacter baumannii, strain 5377 | NC_009085 |
Actinomyces odontolyticus, strain 1A.21 | NZ_AAYI02000000 |
Bacillus cereus, strain NRS 248 | NC_003909 |
Bacteroides vulgatus, strain ATCC® 8482 | NC_009614 |
Clostridium beijerinckii, strain NCIMB 8052 | NC_009617 |
Deinococcus radiodurans, strain R1 (smooth) | NC_001263, NC_001264 |
Enterococcus faecalis, strain OG1RF | NC_017316 |
Escherichia coli, strain K12, substrain MG1655 | NC_000913 |
Helicobacter pylori, strain 26695 | NC_000915 |
Lactobacillus gasseri, strain 63 AM | NC_008530 |
Listeria monocytogenes, strain EGDe | NC_003210 |
Neisseria meningitidis, strain MC58 | NC_003112 |
Propionibacterium acnes, strain KPA171202 | NC_006085 |
Pseudomonas aeruginosa, strain PAO1-LAC | NC_002516 |
Rhodobacter sphaeroides, strain ATH 2.4.1 | NC_007493, NC_007494 |
Staphylococcus aureus, strain TCH1516 | NC_010079 |
Staphylococcus epidermidis, FDA strain PCI 1200 | NC_004461 |
Streptococcus agalactiae, strain 2603 V/R | NC_004116 |
Streptococcus mutans, strain UA159 | NC_004350 |
Streptococcus pneumoniae, strain TIGR4 | NC_003028 |
16S copy number is shown for genomes by the NCBI using method “Best-placed reference protein set; GeneMarkS+”
Acknowledgment for publications should read “The following reagent was obtained through BEI Resources, NIAID, NIH as part of the Human Microbiome Project: Genomic DNA from Microbial Mock Community B (Staggered, Low Concentration), v5.2L, for 16S rRNA Gene Sequencing, HM-783D.”
This sequencing data was generated by PacBio using their in-house 16S amplification and sequencing protocol: https://www.pacb.com/wp-content/uploads/Unsupported-Full-Length-16S-Amplification-SMRTbell-LibraryPreparation-and-Sequencing.pdf (everying is the same as the processing for the Zymo mock community)
The sequencing was performed on a Sequel, with the S/P1-C1.2 sequencing chemistry. Standard PacBio bioinformatics were performed, and consensus sequences were extracting using CCS 3.1.1 with default parameters, minPasses=3
and minPredictedAccuracy=0.999
:
ccs --pbi --force --logLevel=DEBUG --numThreads=16 --minSnr=3.75 --minReadScore=0.65 --maxLength=7000 --minLength=10 --minPasses=3 --minZScore=-5 --maxDropFraction=0.34 --minPredictedAccuracy=0.999 subreads.bam ccs.bam
Load libraries, setup paths, prepare environment:
library(dada2);packageVersion("dada2")
## [1] '1.12.1'
library(Biostrings); packageVersion("Biostrings")
## [1] '2.46.0'
library(ggplot2); packageVersion("ggplot2")
## [1] '3.1.0'
library(reshape2); packageVersion("reshape2")
## [1] '1.4.3'
path <- "~/Desktop/LRAS/Data/BEI" # CHANGE ME to location of the fastq file
path.out <- "Figures/"
path.rds <- "RDS/"
fn <- file.path(path, "SO_BEI-stagLo-16S_CCS-99.9.fq.gz")
F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"
rc <- dada2:::rc
theme_set(theme_bw())
Remove primers and orient reads:
nop <- file.path(path, "noprimers", basename(fn))
prim <- removePrimers(fn, nop, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE, verbose=TRUE)
## Multiple matches to the primer(s) in some sequences. Using the longest possible match.
## Multiple matches to the primer(s) in some reverse-complement sequences. Using the longest possible match.
## 39947 sequences out of 78328 are being reverse-complemented.
## Overwriting file:/Users/bcallah/Desktop/LRAS/Data/BEI/noprimers/SO_BEI-stagLo-16S_CCS-99.9.fq.gz
## Read in 78328, output 73939 (94.4%) filtered sequences.
Very little loss there even though primer indels aren’t allowed currently.
Inspect length distribution.
hist(nchar(getSequences(nop)), 100)
Very strong peak ~1450. Perhaps some secondary peake nearby, should carry through further sequencing.
Filter:
filt <- file.path(path, "noprimers", "filtered", basename(fn))
track <- filterAndTrim(nop, filt, minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)
## Overwriting file:/Users/bcallah/Desktop/LRAS/Data/BEI/noprimers/filtered/SO_BEI-stagLo-16S_CCS-99.9.fq.gz
## Read in 73939, output 69963 (94.6%) filtered sequences.
Dereplicate:
drp <- derepFastq(filt, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/BEI/noprimers/filtered/SO_BEI-stagLo-16S_CCS-99.9.fq.gz
## Encountered 19873 unique sequences from 69963 total sequences read.
Learn errors:
err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun) # seconds
## 102693885 total bases in 69963 reads from 1 samples will be used for learning the error rates.
Looks good.
Inspect errors:
plotErrors(err)
## Warning: Transformation introduced infinite values in continuous y-axis
Denoise:
dd <- dada(drp, err=err, BAND_SIZE=32, multithread=TRUE)
## Sample 1 - 69963 reads in 19873 unique sequences.
dd
## dada-class: object describing DADA2 denoising results
## 51 sequence variants were inferred from 19873 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 32
Read tracking:
cbind(ccs=prim[,1], primers=prim[,2], filtered=track[,2], denoised=sum(dd$denoised))
## ccs primers filtered denoised
## [1,] 78328 73939 69963 69911
Inspect clustering data.frame
sq <- dd$sequence
dd$clustering[,-1]
## abundance n0 n1 nunq pval birth_type birth_pval
## 1 14640 12410 2048 3406 0.000000e+00 I NA
## 2 8286 7074 1097 2185 0.000000e+00 A 0.000000e+00
## 3 6713 5623 1009 1778 0.000000e+00 A 0.000000e+00
## 4 4215 3582 577 1178 0.000000e+00 A 0.000000e+00
## 5 3702 3139 518 1014 0.000000e+00 A 0.000000e+00
## 6 3687 3178 458 1051 0.000000e+00 A 0.000000e+00
## 7 3639 3144 460 1046 0.000000e+00 A 0.000000e+00
## 8 3371 2921 403 949 0.000000e+00 A 0.000000e+00
## 9 3276 2774 464 996 0.000000e+00 A 0.000000e+00
## 10 2917 2502 381 848 0.000000e+00 A 0.000000e+00
## 11 1971 1668 269 627 0.000000e+00 A 0.000000e+00
## 12 1810 1520 262 599 0.000000e+00 A 0.000000e+00
## 13 1752 1475 246 581 0.000000e+00 A 0.000000e+00
## 14 1761 1500 231 604 0.000000e+00 A 0.000000e+00
## 15 1515 1297 196 496 0.000000e+00 A 0.000000e+00
## 16 1872 1603 243 641 0.000000e+00 A 0.000000e+00
## 17 545 475 60 181 0.000000e+00 A 0.000000e+00
## 18 291 251 37 97 0.000000e+00 A 0.000000e+00
## 19 289 249 37 102 0.000000e+00 A 0.000000e+00
## 20 301 259 37 117 0.000000e+00 A 0.000000e+00
## 21 255 222 32 94 0.000000e+00 A 0.000000e+00
## 22 272 237 31 111 0.000000e+00 A 0.000000e+00
## 23 267 226 35 104 0.000000e+00 A 0.000000e+00
## 24 162 142 18 62 0.000000e+00 A 0.000000e+00
## 25 123 107 14 42 0.000000e+00 A 0.000000e+00
## 26 137 114 21 61 0.000000e+00 A 0.000000e+00
## 27 114 105 7 40 0.000000e+00 A 0.000000e+00
## 28 115 102 12 44 0.000000e+00 A 0.000000e+00
## 29 112 99 13 41 0.000000e+00 A 0.000000e+00
## 30 111 102 8 42 0.000000e+00 A 0.000000e+00
## 31 105 96 8 40 0.000000e+00 A 0.000000e+00
## 32 98 85 11 34 0.000000e+00 A 0.000000e+00
## 33 112 91 20 49 0.000000e+00 A 0.000000e+00
## 34 95 82 11 32 0.000000e+00 A 0.000000e+00
## 35 98 84 14 40 0.000000e+00 A 0.000000e+00
## 36 93 81 9 36 0.000000e+00 A 0.000000e+00
## 37 88 69 17 31 0.000000e+00 A 0.000000e+00
## 38 99 89 10 40 0.000000e+00 A 0.000000e+00
## 39 88 76 11 33 0.000000e+00 A 0.000000e+00
## 40 106 89 15 50 0.000000e+00 A 0.000000e+00
## 41 80 65 14 41 0.000000e+00 A 0.000000e+00
## 42 60 50 9 24 0.000000e+00 A 0.000000e+00
## 43 62 55 5 30 0.000000e+00 A 0.000000e+00
## 44 13 9 4 9 0.000000e+00 A 0.000000e+00
## 45 19 3 1 18 0.000000e+00 A 0.000000e+00
## 46 139 120 19 49 1.638113e-265 A 1.222673e-254
## 47 129 103 25 54 6.084991e-218 A 5.170449e-210
## 48 88 77 10 30 1.382173e-175 A 7.672142e-154
## 49 33 28 5 15 1.345078e-147 A 2.385693e-131
## 50 44 37 7 15 4.782749e-95 A 3.384763e-85
## 51 41 35 6 15 7.522708e-95 A 1.971157e-83
## birth_fold birth_ham birth_qave
## 1 NA NA NA
## 2 Inf 220 90.22273
## 3 3.700250e+14 3 91.00000
## 4 5.183341e+03 1 84.00000
## 5 Inf 227 90.62996
## 6 9.562878e+13 3 91.00000
## 7 4.172291e+11 3 84.33333
## 8 1.177350e+08 2 89.50000
## 9 Inf 298 90.40604
## 10 Inf 107 89.71028
## 11 5.350447e+44 9 88.77778
## 12 1.536471e+25 5 91.80000
## 13 4.302466e+25 5 92.00000
## 14 1.211854e+50 10 89.16667
## 15 7.737531e+04 1 90.00000
## 16 Inf 193 90.69948
## 17 8.864293e+94 20 91.10000
## 18 2.661190e+07 2 88.00000
## 19 Inf 273 90.69597
## 20 4.976432e+02 1 93.00000
## 21 Inf 179 90.73184
## 22 1.299704e+05 1 92.00000
## 23 1.146236e+03 1 93.00000
## 24 Inf 286 89.81818
## 25 Inf 229 90.85153
## 26 Inf 279 90.06093
## 27 7.463668e+22 5 92.00000
## 28 Inf 279 89.84588
## 29 3.253601e+07 2 92.50000
## 30 9.656787e+08 2 91.50000
## 31 3.923956e+15 4 92.75000
## 32 1.260826e+17 4 92.50000
## 33 1.035197e+12 3 92.66667
## 34 1.244310e+13 3 92.33333
## 35 8.634754e+16 4 90.00000
## 36 9.053939e+17 4 89.75000
## 37 4.389098e+08 2 92.50000
## 38 7.906752e+15 4 89.25000
## 39 3.535124e+18 4 90.75000
## 40 1.490579e+11 3 92.00000
## 41 Inf 191 90.65445
## 42 9.261894e+17 4 92.25000
## 43 Inf 268 89.71269
## 44 Inf 157 90.10828
## 45 Inf 365 91.32055
## 46 2.338938e+03 1 92.00000
## 47 2.099312e+03 1 92.00000
## 48 1.774947e+03 1 93.00000
## 49 6.686106e+07 2 92.00000
## 50 3.368326e+03 1 93.00000
## 51 5.003415e+03 1 93.00000
Big jump from pval 10-126 to below the threshold at OMEGA_A=10^40
.
Assign taxonomy:
tax <- assignTaxonomy(dd, "~/tax/silva_nr_v128_train_set.fa.gz", multithread=TRUE)
tax[,"Genus"] <- gsub("Escherichia/Shigella", "Escherichia", tax[,"Genus"]) # Reformat to be compatible with other data sources
tax[,"Genus"] <- gsub("Clostridium_sensu_stricto_1", "Clostridium", tax[,"Genus"]) # Reformat to be compatible with other data sources
unname(tax)
## [,1] [,2] [,3]
## [1,] "Bacteria" "Firmicutes" "Bacilli"
## [2,] "Bacteria" "Firmicutes" "Bacilli"
## [3,] "Bacteria" "Firmicutes" "Bacilli"
## [4,] "Bacteria" "Firmicutes" "Bacilli"
## [5,] "Bacteria" "Proteobacteria" "Alphaproteobacteria"
## [6,] "Bacteria" "Firmicutes" "Bacilli"
## [7,] "Bacteria" "Firmicutes" "Bacilli"
## [8,] "Bacteria" "Firmicutes" "Bacilli"
## [9,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [10,] "Bacteria" "Firmicutes" "Bacilli"
## [11,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [12,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [13,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [14,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [15,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [16,] "Bacteria" "Firmicutes" "Bacilli"
## [17,] "Bacteria" "Firmicutes" "Bacilli"
## [18,] "Bacteria" "Firmicutes" "Bacilli"
## [19,] "Bacteria" "Proteobacteria" "Epsilonproteobacteria"
## [20,] "Bacteria" "Firmicutes" "Bacilli"
## [21,] "Bacteria" "Firmicutes" "Bacilli"
## [22,] "Bacteria" "Firmicutes" "Bacilli"
## [23,] "Bacteria" "Firmicutes" "Bacilli"
## [24,] "Bacteria" "Proteobacteria" "Betaproteobacteria"
## [25,] "Bacteria" "Firmicutes" "Clostridia"
## [26,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [27,] "Bacteria" "Firmicutes" "Clostridia"
## [28,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [29,] "Bacteria" "Firmicutes" "Clostridia"
## [30,] "Bacteria" "Firmicutes" "Clostridia"
## [31,] "Bacteria" "Firmicutes" "Clostridia"
## [32,] "Bacteria" "Firmicutes" "Clostridia"
## [33,] "Bacteria" "Firmicutes" "Clostridia"
## [34,] "Bacteria" "Firmicutes" "Clostridia"
## [35,] "Bacteria" "Firmicutes" "Clostridia"
## [36,] "Bacteria" "Firmicutes" "Clostridia"
## [37,] "Bacteria" "Firmicutes" "Clostridia"
## [38,] "Bacteria" "Firmicutes" "Clostridia"
## [39,] "Bacteria" "Firmicutes" "Clostridia"
## [40,] "Bacteria" "Firmicutes" "Clostridia"
## [41,] "Bacteria" "Firmicutes" "Bacilli"
## [42,] "Bacteria" "Firmicutes" "Bacilli"
## [43,] "Bacteria" "Actinobacteria" "Actinobacteria"
## [44,] "Bacteria" "Firmicutes" "Bacilli"
## [45,] "Bacteria" "Bacteroidetes" "Bacteroidia"
## [46,] "Bacteria" "Firmicutes" "Bacilli"
## [47,] "Bacteria" "Firmicutes" "Bacilli"
## [48,] "Bacteria" "Firmicutes" "Clostridia"
## [49,] "Bacteria" "Firmicutes" "Bacilli"
## [50,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [51,] "Bacteria" "Firmicutes" "Bacilli"
## [,4] [,5] [,6]
## [1,] "Lactobacillales" "Streptococcaceae" "Streptococcus"
## [2,] "Bacillales" "Staphylococcaceae" "Staphylococcus"
## [3,] "Lactobacillales" "Streptococcaceae" "Streptococcus"
## [4,] "Bacillales" "Staphylococcaceae" "Staphylococcus"
## [5,] "Rhodobacterales" "Rhodobacteraceae" "Rhodobacter"
## [6,] "Bacillales" "Staphylococcaceae" "Staphylococcus"
## [7,] "Bacillales" "Staphylococcaceae" "Staphylococcus"
## [8,] "Bacillales" "Staphylococcaceae" "Staphylococcus"
## [9,] "Enterobacteriales" "Enterobacteriaceae" "Escherichia"
## [10,] "Lactobacillales" "Streptococcaceae" "Streptococcus"
## [11,] "Enterobacteriales" "Enterobacteriaceae" "Escherichia"
## [12,] "Enterobacteriales" "Enterobacteriaceae" "Escherichia"
## [13,] "Enterobacteriales" "Enterobacteriaceae" "Escherichia"
## [14,] "Enterobacteriales" "Enterobacteriaceae" "Escherichia"
## [15,] "Enterobacteriales" "Enterobacteriaceae" "Escherichia"
## [16,] "Bacillales" "Bacillaceae" "Bacillus"
## [17,] "Bacillales" "Staphylococcaceae" "Staphylococcus"
## [18,] "Bacillales" "Staphylococcaceae" "Staphylococcus"
## [19,] "Campylobacterales" "Helicobacteraceae" "Helicobacter"
## [20,] "Bacillales" "Staphylococcaceae" "Staphylococcus"
## [21,] "Lactobacillales" "Lactobacillaceae" "Lactobacillus"
## [22,] "Bacillales" "Staphylococcaceae" "Staphylococcus"
## [23,] "Bacillales" "Staphylococcaceae" "Staphylococcus"
## [24,] "Neisseriales" "Neisseriaceae" "Neisseria"
## [25,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [26,] "Pseudomonadales" "Pseudomonadaceae" "Pseudomonas"
## [27,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [28,] "Pseudomonadales" "Moraxellaceae" "Acinetobacter"
## [29,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [30,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [31,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [32,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [33,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [34,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [35,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [36,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [37,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [38,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [39,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [40,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [41,] "Bacillales" "Listeriaceae" "Listeria"
## [42,] "Bacillales" "Listeriaceae" "Listeria"
## [43,] "Propionibacteriales" "Propionibacteriaceae" "Propionibacterium"
## [44,] "Lactobacillales" "Enterococcaceae" "Enterococcus"
## [45,] "Bacteroidales" "Bacteroidaceae" "Bacteroides"
## [46,] "Bacillales" "Bacillaceae" "Bacillus"
## [47,] "Bacillales" "Bacillaceae" "Bacillus"
## [48,] "Clostridiales" "Clostridiaceae_1" "Clostridium"
## [49,] "Bacillales" "Listeriaceae" "Listeria"
## [50,] "Pseudomonadales" "Pseudomonadaceae" "Pseudomonas"
## [51,] "Bacillales" "Listeriaceae" "Listeria"
Plot abundances:
plot(dd$denoised)
plot(dd$denoised, log="y")
Broad range of abundances over 3 orders of magnitude (10-10000).
Check chiemras:
bim <- isBimeraDenovo(dd, minFoldParentOverAbundance=3.5, multithread=TRUE)
# Higher MFPOA to avoid flagging intra-genomic variants
table(bim)
## bim
## FALSE
## 51
No chimeras!
table(tax[,"Genus"])
##
## Acinetobacter Bacillus Bacteroides Clostridium
## 1 3 1 15
## Enterococcus Escherichia Helicobacter Lactobacillus
## 1 6 1 1
## Listeria Neisseria Propionibacterium Pseudomonas
## 4 1 1 2
## Rhodobacter Staphylococcus Streptococcus
## 1 10 3
Save processed objects for future analysis.
saveRDS(dd, file.path(path.rds, "HMP_dd.rds"))
saveRDS(tax, file.path(path.rds, "HMP_tax_Silva128.rds"))
Reload processed data objects (can run code below from here in absence of input sequence data):
dd <- readRDS(file.path(path.rds, "HMP_dd.rds"))
tax <- readRDS(file.path(path.rds, "HMP_tax_Silva128.rds"))
BLAST these 51 sequences against nt (https://blast.ncbi.nlm.nih.gov/Blast.cgi):
## dada2:::pfasta(dd, id=tax[,6])
uniquesToFasta(dd, file.path(path.rds, "LRAS_HMP_ASVs.fa"), ids=paste0(tax[,"Genus"], ";size=", dd$denoised, ";"))
is.ec <- tax[,"Genus"] %in% "Escherichia"
uniquesToFasta(dd$clustering[is.ec,], file.path(path.rds, "HMP_Ecoli.fa"), ids=paste0("Ecoli", seq(sum(is.ec)), "_HMP"))
## BLAST against nt excluding uncultured/environmental sequences (checkbox)
Results of BLAST search on July 19, 2018 are recorded.
Store these results in R format:
refhit.nt <- ifelse(seq_along(sq) %in% c(7, 34, 37), FALSE, TRUE)
which(!refhit.nt)
## [1] 7 34 37
48/51 of these are exact matches to sequenced and cultured isolates, over their entire ~1.5Kb length!
Actinomyces odontolyticus, Deinococcus radiodurans and Streptococcus pneumoniae were expected members of this mock community that were not detected in the data.
Let’s deconvolute a couple strains by assigning species based on BLAST results to genera with multiple species (Staphylococcus and Streptococcus):
tax[c(1,3),"Genus"] <- "Streptococcus mutans"
tax[10,"Genus"] <- "Streptococcus agalactiae"
tax[c(2,4,6,7,8),"Genus"] <- "Staphylococcus epidermis"
tax[c(17,18,20,22,23),"Genus"] <- "Staphylococcus aureus"
BEI does not provide reference sequences for these strains. However, Robert Edgar has generated a set of references sequences as best he can based on publicy deposited genomes, and we will use that here:
fn16S <- file.path("Docs/BEI_Reference_Edgar.fa")
refs <- getSequences(fn16S)
unique(sapply(strsplit(names(refs), "\\s"), function(x) paste(x[2], x[3])))
## [1] "Streptococcus agalactiae" "Lactobacillus gasseri"
## [3] "Bacillus cereus" "Acinetobacter baumannii"
## [5] "Porphyromonas gingivalis" "Actinomyces odontolyticus"
## [7] "Streptococcus pneumoniae" "Neisseria meningitidis"
## [9] "Streptococcus mutans" "Enterococcus faecalis"
## [11] "Rhodobacter sphaeroides" "Propionibacterium acnes"
## [13] "Pseudomonas aeruginosa" "Deinococcus radiodurans"
## [15] "Bacteroides vulgatus" "Escherichia coli"
## [17] "Listeria monocytogenes" "Staphylococcus aureus"
## [19] "Clostridium beijerinckii" "Methanobrevibacter smithii"
## [21] "Staphylococcus epidermidis" "Helicobacter pylori"
Twenty-two species, whereas only twenty expected in this HMP mock. The extra taxa are Methanobrevibacter smithii and Porphyromonas gingivalis, otherwise everything matches expectations.
Name the reference sequences something useful and count the number of unique (I think) variants in each:
ref.16S <- getSequences(fn16S)
genus.spc <- sapply(strsplit(names(ref.16S), "\\s"), function(x) paste(x[2], x[3], sep="_"))
variant <- sapply(seq_along(genus.spc), function(i) sum(genus.spc[1:i] == genus.spc[[i]]))
names(ref.16S) <- paste(genus.spc, variant, sep="_")
nunq.ref <- table(genus.spc)
nunq.ref
## genus.spc
## Acinetobacter_baumannii Actinomyces_odontolyticus
## 1 1
## Bacillus_cereus Bacteroides_vulgatus
## 7 6
## Clostridium_beijerinckii Deinococcus_radiodurans
## 14 2
## Enterococcus_faecalis Escherichia_coli
## 2 6
## Helicobacter_pylori Lactobacillus_gasseri
## 2 1
## Listeria_monocytogenes Methanobrevibacter_smithii
## 4 2
## Neisseria_meningitidis Porphyromonas_gingivalis
## 1 1
## Propionibacterium_acnes Pseudomonas_aeruginosa
## 2 2
## Rhodobacter_sphaeroides Staphylococcus_aureus
## 2 4
## Staphylococcus_epidermidis Streptococcus_agalactiae
## 5 1
## Streptococcus_mutans Streptococcus_pneumoniae
## 2 1
table(tax[,"Genus"])
##
## Acinetobacter Bacillus Bacteroides
## 1 3 1
## Clostridium Enterococcus Escherichia
## 15 1 6
## Helicobacter Lactobacillus Listeria
## 1 1 4
## Neisseria Propionibacterium Pseudomonas
## 1 1 2
## Rhodobacter Staphylococcus aureus Staphylococcus epidermis
## 1 5 5
## Streptococcus agalactiae Streptococcus mutans
## 1 2
Now compare to DADA2 inferred sequences to the reference sequences, allowing no mismatches (but with fixed=FALSE
so as to match ambiguous nucleotides in the references):
sq <- dd$sequence
refmat.16S <- sapply(DNAStringSet(sq), vcountPattern, subject=DNAStringSet(ref.16S), fixed=FALSE)
rownames(refmat.16S) <- names(ref.16S)
colnames(refmat.16S) <- paste0("Seq", seq_along(sq))
Inspect:
refhit.16S <- colSums(refmat.16S)>0
which(!refhit.16S)
## Seq4 Seq6 Seq7 Seq19 Seq20 Seq33 Seq34 Seq37
## 4 6 7 19 20 33 34 37
table(tax[,"Genus"], Reference.Match=refhit.16S)
## Reference.Match
## FALSE TRUE
## Acinetobacter 0 1
## Bacillus 0 3
## Bacteroides 0 1
## Clostridium 3 12
## Enterococcus 0 1
## Escherichia 0 6
## Helicobacter 1 0
## Lactobacillus 0 1
## Listeria 0 4
## Neisseria 0 1
## Propionibacterium 0 1
## Pseudomonas 0 2
## Rhodobacter 0 1
## Staphylococcus aureus 1 4
## Staphylococcus epidermis 3 2
## Streptococcus agalactiae 0 1
## Streptococcus mutans 0 2
So 43/51 exactly match against this 16S reference based on the described strains. The 8 that don’t hit include 3/5 S. epidermis sequence variants, 1/5 S. aureus variants, 1/1 H. pylori variant, and 3/15 C. bejerenckii variants. They overlap with 1/5 S. epidermis variants, and 2/15 C. bejerenckii variants that also were not matched against nt.
Now we’ll investigate the “stoichiometry” of the sequence variants from each bacterial strain, i.e. the copy number of each 16S sequence variants relative to the genomic copy number, all inferred from the dada2 processed data and the rRNA gene copy numbers given for each strain in the Zymo reference.
These 16S copy numbers are based on (1) the NCBI reference sequence annotation if present, (2) an estimate from https://rrndb.umms.med.umich.edu/search/, and (3) inspection of the number of 16S rRNA genes in the best BLAST hits of each strain’s ASVs. If copy number was ambiguous, the best BLAST hits were chosen to break the tie, as likely to most closely capture the actual strain being interrogated here.
ncopy <- c("Acinetobacter" = 6,
"Actinomyces"=2,
"Bacillus" = 12,
"Bacteroides" = 7,
"Clostridium" = 14,
"Deinococcus" = 3,
"Enterococcus" = 4,
"Escherichia" = 7,
"Helicobacter" = 2,
"Lactobacillus" = 6,
"Listeria" = 6,
"Neisseria" = 4,
"Propionibacterium" = 3,
"Pseudomonas" = 4,
"Rhodobacter" = 4,
"Staphylococcus aureus" = 6,
"Staphylococcus epidermis" = 6,
"Streptococcus agalactiae" = 7,
"Streptococcus mutans" = 6,
"Streptococcus pneumoniae" = 4)
First calculate the genomic abundances, by summing over all sequence variants for each genus and dividing by the 16S copy number:
gens <- sort(unique(tax[,"Genus"])) # Just the ones that were detected
abund.ome <- sapply(gens, function(gen) {
is.gen <- grepl(gen, tax[,"Genus"])
sum(dd$denoised[is.gen])/ncopy[gen] ###!
### sum(dd$clustering$n0[is.gen])/ncopy[gen] ###! Alternative w/o error-correction
})
names(abund.ome) <- gens
dfgen <- data.frame(Genus=gens, Abundance=abund.ome, stringsAsFactors = FALSE)
p.abundome <- ggplot(data=dfgen, aes(x=Genus, y=Abundance)) +
theme(axis.text.x=element_blank(), axis.ticks.x = element_blank()) +
ylab("Genome Abundance") + xlab("Strains")
p.abundome + geom_col(width=0.4, aes(fill=Genus)) # scale_fill_manual(values=genusPalette) +
p.abundome.nice <- p.abundome+ geom_point(aes(color=Genus), shape="diamond", size=4) +
scale_y_log10(limits=c(0.3, NA), breaks=c(1,10,100,1000))
p.abundome.nice
ggsave(file.path(path.out, "HMP_GenomeAbundances.pdf"), p.abundome.nice,
width=6, height=4, units="in", useDingbats=FALSE)
Now we make a similar plot for the abundance of each sequence variant:
dfasv <- data.frame(Genus=tax[,"Genus"],
Abundance=dd$denoised, ###!
### Abundance=dd$clustering$n0, ###! Alternative w/o error-correction
stringsAsFactors = FALSE)
rownames(dfasv) <- NULL
p.abundasv <- ggplot(data=dfasv, aes(x=Genus, y=Abundance)) +
geom_point(aes(color=Genus), shape="x", size=4) +
ylim(c(0, NA)) +
theme(axis.text.x=element_blank(), axis.ticks.x = element_blank()) +
ylab("ASV Abundance")
p.abundasv
p.abundasv + ylim(0, 500)
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 17 rows containing missing values (geom_point).
summary(dfasv$Abundance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.0 96.5 137.0 1370.8 1785.5 14640.0
summary(dfasv$Abundance)/sum(dfasv$Abundance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0001859 0.0013803 0.0019596 0.0196078 0.0255396 0.2094091
Yep, broad range of abundances over 3 orders of magnitude at the genome and sequence-variant level. Concentrations of DNA from each strain, as assyaed by Qubit, are in the BEI Certificat of Analysis doc:
qubit <- c("Acinetobacter" = 0.010,
"Actinomyces"=0.001,
"Bacillus" = 0.045,
"Bacteroides" = 0.001,
"Clostridium" = 0.044,
"Deinococcus" = 0.001,
"Enterococcus" = 0.001,
"Escherichia" = 0.680,
"Helicobacter" = 0.009,
"Lactobacillus" = 0.003,
"Listeria" = 0.005,
"Neisseria" = 0.006,
"Propionibacterium" = 0.009,
"Pseudomonas" = 0.161,
"Rhodobacter" = 1.413,
"Staphylococcus aureus" = 0.059,
"Staphylococcus epidermis" = 0.513,
"Streptococcus agalactiae" = 0.032,
"Streptococcus mutans" = 0.417,
"Streptococcus pneumoniae" = 0.001)
Note that the dropouts (S. pneumoniae, Deinococcus and Actinomyces) all had the lowest Qubit measurements (0.001), which may be a minimum assigned value rather than a true value anyway (e.g. ND). These were too rare to detect.
plot(qubit[gens], abund.ome[gens], log="xy")
Decent correspondence.
And now we make the “stoichiometry” figure, i.e the abundance of each ASV scaled to the genomic abundance:
dfasv$ScaledAbundance <- dfasv$Abundance/abund.ome[dfasv$Genus]
# Number the ASVs in each strain/genus
dfasv$Variant <- sapply(seq(nrow(dfasv)), function(i) sum(dfasv$Genus[1:i] == dfasv$Genus[[i]], na.rm=TRUE))
p.stoich <- ggplot(data=dfasv, aes(x=Variant, y=ScaledAbundance, fill=Genus, width=0.5)) + geom_col() +
facet_wrap(~Genus, nrow=3) +
scale_y_continuous(breaks=seq(0, round(max(dfasv$ScaledAbundance))), minor_breaks=NULL) +
theme(panel.grid.major.y=element_line(color="black", size=0.25)) +
theme(panel.grid.major.x=element_blank(), panel.grid.minor.x=element_blank()) +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
xlab("Full-length 16S Sequence Variants") +
ylab("Abundance (per-genome)")
p.stoich
ggsave(file.path(path.out, "HMP_Stoich.pdf"), width=12, height=7, units="in")
Now let’s put the genome abundances together with the integers in a simpler plot:
p.stoich.nice <- ggplot(data=dfasv, aes(x=Genus, y=ScaledAbundance, color=Genus, width=0.5)) +
geom_point(shape="x", size=4) +
scale_y_continuous(breaks=seq(0, round(max(dfasv$ScaledAbundance))), minor_breaks=NULL) +
theme(panel.grid.major.y=element_line(color="black", size=0.25)) +
theme(panel.grid.major.x=element_blank(), panel.grid.minor.x=element_blank()) +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
xlab("Full-length 16S Sequence Variants") +
ylab("ASV Abundance (per-genome)")
p.stoich.nice
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:BiocGenerics':
##
## combine
p.nice <- arrangeGrob(p.abundome.nice, p.stoich.nice + guides(color=FALSE), nrow=1, widths=c(0.6, 0.4))
plot(p.nice)
ggsave(file.path(path.out, "HMP_AbundanceAndStoich.pdf"), p.nice,
width=10, height=4, units="in", useDingbats=FALSE)
B. cereus a little off, and Listeria a little off. Others all look very good (although many are just one variant, so trivial integral-ness).
Some is due to low numbers as well, e.g Listeria seems largely down to counting noise:
is.gen <- tax[,"Genus"] == "Listeria"
foo <- data.frame(denoised=dd$denoised[is.gen], n0=dd$clustering$n0[is.gen], exact=drp$uniques[dd$sequence[is.gen]])
rownames(foo) <- NULL
foo
## denoised n0 exact
## 1 80 65 40
## 2 60 50 37
## 3 33 28 19
## 4 41 35 27
Diff between 1/2 increases a lot as reads are corrected. Might just be noise though, not huge numbers here.
However Bacillus is difficult to explain by counting noise. Could the rrn copy number be wrong? All three variants exactly match previously observed sequences so these ASVs being uncorrected sequencing errors seems unlikely.
For completeness, let’s calculate the residual error rate if those low-frequency B. cereus ASVs are errors:
is.bc <- tax[,"Genus"] %in% "Bacillus"
sq.bc <- dd$sequence[is.bc]
# Candidate errors are the 2nd and 3rd B cereus ASVs
tot.err <- sum(nwhamming(sq.bc[[1]], sq.bc[2:3])*dd$denoised[sq.bc[2:3]])
tot.nt <- sum(dd$denoised * nchar(dd$sequence))
tot.err/tot.nt
## [1] 2.611897e-06