Data

The mock community being sequenced is the BEI Resources Microbial Mock Community B (Staggered, Low Concentration) community: https://www.beiresources.org/Catalog/otherProducts/HM-783D.aspx. This mock community (also known as the HMP mock community) contains the following bacterial in the following table, adapted from Table 1 in the BEI document, with alternative strain names and 16S copy numbers added based on the NCBI entry.

Organism NCBI Reference Sequence
Acinetobacter baumannii, strain 5377 NC_009085
Actinomyces odontolyticus, strain 1A.21 NZ_AAYI02000000
Bacillus cereus, strain NRS 248 NC_003909
Bacteroides vulgatus, strain ATCC® 8482 NC_009614
Clostridium beijerinckii, strain NCIMB 8052 NC_009617
Deinococcus radiodurans, strain R1 (smooth) NC_001263, NC_001264
Enterococcus faecalis, strain OG1RF NC_017316
Escherichia coli, strain K12, substrain MG1655 NC_000913
Helicobacter pylori, strain 26695 NC_000915
Lactobacillus gasseri, strain 63 AM NC_008530
Listeria monocytogenes, strain EGDe NC_003210
Neisseria meningitidis, strain MC58 NC_003112
Propionibacterium acnes, strain KPA171202 NC_006085
Pseudomonas aeruginosa, strain PAO1-LAC NC_002516
Rhodobacter sphaeroides, strain ATH 2.4.1 NC_007493, NC_007494
Staphylococcus aureus, strain TCH1516 NC_010079
Staphylococcus epidermidis, FDA strain PCI 1200 NC_004461
Streptococcus agalactiae, strain 2603 V/R NC_004116
Streptococcus mutans, strain UA159 NC_004350
Streptococcus pneumoniae, strain TIGR4 NC_003028

16S copy number is shown for genomes by the NCBI using method “Best-placed reference protein set; GeneMarkS+”

Acknowledgment for publications should read “The following reagent was obtained through BEI Resources, NIAID, NIH as part of the Human Microbiome Project: Genomic DNA from Microbial Mock Community B (Staggered, Low Concentration), v5.2L, for 16S rRNA Gene Sequencing, HM-783D.”

This sequencing data was generated by PacBio using their in-house 16S amplification and sequencing protocol: https://www.pacb.com/wp-content/uploads/Unsupported-Full-Length-16S-Amplification-SMRTbell-LibraryPreparation-and-Sequencing.pdf (everying is the same as the processing for the Zymo mock community)

The sequencing was performed on a Sequel, with the S/P1-C1.2 sequencing chemistry. Standard PacBio bioinformatics were performed, and consensus sequences were extracting using CCS 3.1.1 with default parameters, minPasses=3 and minPredictedAccuracy=0.999:

ccs --pbi --force --logLevel=DEBUG --numThreads=16 --minSnr=3.75 --minReadScore=0.65 --maxLength=7000 --minLength=10 --minPasses=3 --minZScore=-5 --maxDropFraction=0.34 --minPredictedAccuracy=0.999 subreads.bam ccs.bam

Setup

Load libraries, setup paths, prepare environment:

library(dada2);packageVersion("dada2")
## [1] '1.12.1'
library(Biostrings); packageVersion("Biostrings")
## [1] '2.46.0'
library(ggplot2); packageVersion("ggplot2")
## [1] '3.1.0'
library(reshape2); packageVersion("reshape2")
## [1] '1.4.3'
path <- "~/Desktop/LRAS/Data/BEI" # CHANGE ME to location of the fastq file
path.out <- "Figures/"
path.rds <- "RDS/"
fn <- file.path(path, "SO_BEI-stagLo-16S_CCS-99.9.fq.gz")
F27 <- "AGRGTTYGATYMTGGCTCAG"
R1492 <- "RGYTACCTTGTTACGACTT"
rc <- dada2:::rc
theme_set(theme_bw())

Remove Primers and Filter

Remove primers and orient reads:

nop <- file.path(path, "noprimers", basename(fn))
prim <- removePrimers(fn, nop, primer.fwd=F27, primer.rev=dada2:::rc(R1492), orient=TRUE, verbose=TRUE)
## Multiple matches to the primer(s) in some sequences. Using the longest possible match.
## Multiple matches to the primer(s) in some reverse-complement sequences. Using the longest possible match.
## 39947 sequences out of 78328 are being reverse-complemented.
## Overwriting file:/Users/bcallah/Desktop/LRAS/Data/BEI/noprimers/SO_BEI-stagLo-16S_CCS-99.9.fq.gz
## Read in 78328, output 73939 (94.4%) filtered sequences.

Very little loss there even though primer indels aren’t allowed currently.

Inspect length distribution.

hist(nchar(getSequences(nop)), 100)

Very strong peak ~1450. Perhaps some secondary peake nearby, should carry through further sequencing.

Filter:

filt <- file.path(path, "noprimers", "filtered", basename(fn))
track <- filterAndTrim(nop, filt, minQ=3, minLen=1000, maxLen=1600, maxN=0, rm.phix=FALSE, maxEE=2, verbose=TRUE)
## Overwriting file:/Users/bcallah/Desktop/LRAS/Data/BEI/noprimers/filtered/SO_BEI-stagLo-16S_CCS-99.9.fq.gz
## Read in 73939, output 69963 (94.6%) filtered sequences.

Run DADA2

Dereplicate:

drp <- derepFastq(filt, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/BEI/noprimers/filtered/SO_BEI-stagLo-16S_CCS-99.9.fq.gz
## Encountered 19873 unique sequences from 69963 total sequences read.

Learn errors:

err <- learnErrors(drp, BAND_SIZE=32, multithread=TRUE, errorEstimationFunction=dada2:::PacBioErrfun) # seconds
## 102693885 total bases in 69963 reads from 1 samples will be used for learning the error rates.

Looks good.

Inspect errors:

plotErrors(err)
## Warning: Transformation introduced infinite values in continuous y-axis

Denoise:

dd <- dada(drp, err=err, BAND_SIZE=32, multithread=TRUE)
## Sample 1 - 69963 reads in 19873 unique sequences.
dd
## dada-class: object describing DADA2 denoising results
## 51 sequence variants were inferred from 19873 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 32

Read tracking:

cbind(ccs=prim[,1], primers=prim[,2], filtered=track[,2], denoised=sum(dd$denoised))
##        ccs primers filtered denoised
## [1,] 78328   73939    69963    69911

Inspect clustering data.frame

sq <- dd$sequence
dd$clustering[,-1]
##    abundance    n0   n1 nunq          pval birth_type    birth_pval
## 1      14640 12410 2048 3406  0.000000e+00          I            NA
## 2       8286  7074 1097 2185  0.000000e+00          A  0.000000e+00
## 3       6713  5623 1009 1778  0.000000e+00          A  0.000000e+00
## 4       4215  3582  577 1178  0.000000e+00          A  0.000000e+00
## 5       3702  3139  518 1014  0.000000e+00          A  0.000000e+00
## 6       3687  3178  458 1051  0.000000e+00          A  0.000000e+00
## 7       3639  3144  460 1046  0.000000e+00          A  0.000000e+00
## 8       3371  2921  403  949  0.000000e+00          A  0.000000e+00
## 9       3276  2774  464  996  0.000000e+00          A  0.000000e+00
## 10      2917  2502  381  848  0.000000e+00          A  0.000000e+00
## 11      1971  1668  269  627  0.000000e+00          A  0.000000e+00
## 12      1810  1520  262  599  0.000000e+00          A  0.000000e+00
## 13      1752  1475  246  581  0.000000e+00          A  0.000000e+00
## 14      1761  1500  231  604  0.000000e+00          A  0.000000e+00
## 15      1515  1297  196  496  0.000000e+00          A  0.000000e+00
## 16      1872  1603  243  641  0.000000e+00          A  0.000000e+00
## 17       545   475   60  181  0.000000e+00          A  0.000000e+00
## 18       291   251   37   97  0.000000e+00          A  0.000000e+00
## 19       289   249   37  102  0.000000e+00          A  0.000000e+00
## 20       301   259   37  117  0.000000e+00          A  0.000000e+00
## 21       255   222   32   94  0.000000e+00          A  0.000000e+00
## 22       272   237   31  111  0.000000e+00          A  0.000000e+00
## 23       267   226   35  104  0.000000e+00          A  0.000000e+00
## 24       162   142   18   62  0.000000e+00          A  0.000000e+00
## 25       123   107   14   42  0.000000e+00          A  0.000000e+00
## 26       137   114   21   61  0.000000e+00          A  0.000000e+00
## 27       114   105    7   40  0.000000e+00          A  0.000000e+00
## 28       115   102   12   44  0.000000e+00          A  0.000000e+00
## 29       112    99   13   41  0.000000e+00          A  0.000000e+00
## 30       111   102    8   42  0.000000e+00          A  0.000000e+00
## 31       105    96    8   40  0.000000e+00          A  0.000000e+00
## 32        98    85   11   34  0.000000e+00          A  0.000000e+00
## 33       112    91   20   49  0.000000e+00          A  0.000000e+00
## 34        95    82   11   32  0.000000e+00          A  0.000000e+00
## 35        98    84   14   40  0.000000e+00          A  0.000000e+00
## 36        93    81    9   36  0.000000e+00          A  0.000000e+00
## 37        88    69   17   31  0.000000e+00          A  0.000000e+00
## 38        99    89   10   40  0.000000e+00          A  0.000000e+00
## 39        88    76   11   33  0.000000e+00          A  0.000000e+00
## 40       106    89   15   50  0.000000e+00          A  0.000000e+00
## 41        80    65   14   41  0.000000e+00          A  0.000000e+00
## 42        60    50    9   24  0.000000e+00          A  0.000000e+00
## 43        62    55    5   30  0.000000e+00          A  0.000000e+00
## 44        13     9    4    9  0.000000e+00          A  0.000000e+00
## 45        19     3    1   18  0.000000e+00          A  0.000000e+00
## 46       139   120   19   49 1.638113e-265          A 1.222673e-254
## 47       129   103   25   54 6.084991e-218          A 5.170449e-210
## 48        88    77   10   30 1.382173e-175          A 7.672142e-154
## 49        33    28    5   15 1.345078e-147          A 2.385693e-131
## 50        44    37    7   15  4.782749e-95          A  3.384763e-85
## 51        41    35    6   15  7.522708e-95          A  1.971157e-83
##      birth_fold birth_ham birth_qave
## 1            NA        NA         NA
## 2           Inf       220   90.22273
## 3  3.700250e+14         3   91.00000
## 4  5.183341e+03         1   84.00000
## 5           Inf       227   90.62996
## 6  9.562878e+13         3   91.00000
## 7  4.172291e+11         3   84.33333
## 8  1.177350e+08         2   89.50000
## 9           Inf       298   90.40604
## 10          Inf       107   89.71028
## 11 5.350447e+44         9   88.77778
## 12 1.536471e+25         5   91.80000
## 13 4.302466e+25         5   92.00000
## 14 1.211854e+50        10   89.16667
## 15 7.737531e+04         1   90.00000
## 16          Inf       193   90.69948
## 17 8.864293e+94        20   91.10000
## 18 2.661190e+07         2   88.00000
## 19          Inf       273   90.69597
## 20 4.976432e+02         1   93.00000
## 21          Inf       179   90.73184
## 22 1.299704e+05         1   92.00000
## 23 1.146236e+03         1   93.00000
## 24          Inf       286   89.81818
## 25          Inf       229   90.85153
## 26          Inf       279   90.06093
## 27 7.463668e+22         5   92.00000
## 28          Inf       279   89.84588
## 29 3.253601e+07         2   92.50000
## 30 9.656787e+08         2   91.50000
## 31 3.923956e+15         4   92.75000
## 32 1.260826e+17         4   92.50000
## 33 1.035197e+12         3   92.66667
## 34 1.244310e+13         3   92.33333
## 35 8.634754e+16         4   90.00000
## 36 9.053939e+17         4   89.75000
## 37 4.389098e+08         2   92.50000
## 38 7.906752e+15         4   89.25000
## 39 3.535124e+18         4   90.75000
## 40 1.490579e+11         3   92.00000
## 41          Inf       191   90.65445
## 42 9.261894e+17         4   92.25000
## 43          Inf       268   89.71269
## 44          Inf       157   90.10828
## 45          Inf       365   91.32055
## 46 2.338938e+03         1   92.00000
## 47 2.099312e+03         1   92.00000
## 48 1.774947e+03         1   93.00000
## 49 6.686106e+07         2   92.00000
## 50 3.368326e+03         1   93.00000
## 51 5.003415e+03         1   93.00000

Big jump from pval 10-126 to below the threshold at OMEGA_A=10^40.

Assign taxonomy:

tax <- assignTaxonomy(dd, "~/tax/silva_nr_v128_train_set.fa.gz", multithread=TRUE)
tax[,"Genus"] <- gsub("Escherichia/Shigella", "Escherichia", tax[,"Genus"]) # Reformat to be compatible with other data sources
tax[,"Genus"] <- gsub("Clostridium_sensu_stricto_1", "Clostridium", tax[,"Genus"]) # Reformat to be compatible with other data sources
unname(tax)
##       [,1]       [,2]             [,3]                   
##  [1,] "Bacteria" "Firmicutes"     "Bacilli"              
##  [2,] "Bacteria" "Firmicutes"     "Bacilli"              
##  [3,] "Bacteria" "Firmicutes"     "Bacilli"              
##  [4,] "Bacteria" "Firmicutes"     "Bacilli"              
##  [5,] "Bacteria" "Proteobacteria" "Alphaproteobacteria"  
##  [6,] "Bacteria" "Firmicutes"     "Bacilli"              
##  [7,] "Bacteria" "Firmicutes"     "Bacilli"              
##  [8,] "Bacteria" "Firmicutes"     "Bacilli"              
##  [9,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"  
## [10,] "Bacteria" "Firmicutes"     "Bacilli"              
## [11,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"  
## [12,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"  
## [13,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"  
## [14,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"  
## [15,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"  
## [16,] "Bacteria" "Firmicutes"     "Bacilli"              
## [17,] "Bacteria" "Firmicutes"     "Bacilli"              
## [18,] "Bacteria" "Firmicutes"     "Bacilli"              
## [19,] "Bacteria" "Proteobacteria" "Epsilonproteobacteria"
## [20,] "Bacteria" "Firmicutes"     "Bacilli"              
## [21,] "Bacteria" "Firmicutes"     "Bacilli"              
## [22,] "Bacteria" "Firmicutes"     "Bacilli"              
## [23,] "Bacteria" "Firmicutes"     "Bacilli"              
## [24,] "Bacteria" "Proteobacteria" "Betaproteobacteria"   
## [25,] "Bacteria" "Firmicutes"     "Clostridia"           
## [26,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"  
## [27,] "Bacteria" "Firmicutes"     "Clostridia"           
## [28,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"  
## [29,] "Bacteria" "Firmicutes"     "Clostridia"           
## [30,] "Bacteria" "Firmicutes"     "Clostridia"           
## [31,] "Bacteria" "Firmicutes"     "Clostridia"           
## [32,] "Bacteria" "Firmicutes"     "Clostridia"           
## [33,] "Bacteria" "Firmicutes"     "Clostridia"           
## [34,] "Bacteria" "Firmicutes"     "Clostridia"           
## [35,] "Bacteria" "Firmicutes"     "Clostridia"           
## [36,] "Bacteria" "Firmicutes"     "Clostridia"           
## [37,] "Bacteria" "Firmicutes"     "Clostridia"           
## [38,] "Bacteria" "Firmicutes"     "Clostridia"           
## [39,] "Bacteria" "Firmicutes"     "Clostridia"           
## [40,] "Bacteria" "Firmicutes"     "Clostridia"           
## [41,] "Bacteria" "Firmicutes"     "Bacilli"              
## [42,] "Bacteria" "Firmicutes"     "Bacilli"              
## [43,] "Bacteria" "Actinobacteria" "Actinobacteria"       
## [44,] "Bacteria" "Firmicutes"     "Bacilli"              
## [45,] "Bacteria" "Bacteroidetes"  "Bacteroidia"          
## [46,] "Bacteria" "Firmicutes"     "Bacilli"              
## [47,] "Bacteria" "Firmicutes"     "Bacilli"              
## [48,] "Bacteria" "Firmicutes"     "Clostridia"           
## [49,] "Bacteria" "Firmicutes"     "Bacilli"              
## [50,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"  
## [51,] "Bacteria" "Firmicutes"     "Bacilli"              
##       [,4]                  [,5]                   [,6]               
##  [1,] "Lactobacillales"     "Streptococcaceae"     "Streptococcus"    
##  [2,] "Bacillales"          "Staphylococcaceae"    "Staphylococcus"   
##  [3,] "Lactobacillales"     "Streptococcaceae"     "Streptococcus"    
##  [4,] "Bacillales"          "Staphylococcaceae"    "Staphylococcus"   
##  [5,] "Rhodobacterales"     "Rhodobacteraceae"     "Rhodobacter"      
##  [6,] "Bacillales"          "Staphylococcaceae"    "Staphylococcus"   
##  [7,] "Bacillales"          "Staphylococcaceae"    "Staphylococcus"   
##  [8,] "Bacillales"          "Staphylococcaceae"    "Staphylococcus"   
##  [9,] "Enterobacteriales"   "Enterobacteriaceae"   "Escherichia"      
## [10,] "Lactobacillales"     "Streptococcaceae"     "Streptococcus"    
## [11,] "Enterobacteriales"   "Enterobacteriaceae"   "Escherichia"      
## [12,] "Enterobacteriales"   "Enterobacteriaceae"   "Escherichia"      
## [13,] "Enterobacteriales"   "Enterobacteriaceae"   "Escherichia"      
## [14,] "Enterobacteriales"   "Enterobacteriaceae"   "Escherichia"      
## [15,] "Enterobacteriales"   "Enterobacteriaceae"   "Escherichia"      
## [16,] "Bacillales"          "Bacillaceae"          "Bacillus"         
## [17,] "Bacillales"          "Staphylococcaceae"    "Staphylococcus"   
## [18,] "Bacillales"          "Staphylococcaceae"    "Staphylococcus"   
## [19,] "Campylobacterales"   "Helicobacteraceae"    "Helicobacter"     
## [20,] "Bacillales"          "Staphylococcaceae"    "Staphylococcus"   
## [21,] "Lactobacillales"     "Lactobacillaceae"     "Lactobacillus"    
## [22,] "Bacillales"          "Staphylococcaceae"    "Staphylococcus"   
## [23,] "Bacillales"          "Staphylococcaceae"    "Staphylococcus"   
## [24,] "Neisseriales"        "Neisseriaceae"        "Neisseria"        
## [25,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [26,] "Pseudomonadales"     "Pseudomonadaceae"     "Pseudomonas"      
## [27,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [28,] "Pseudomonadales"     "Moraxellaceae"        "Acinetobacter"    
## [29,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [30,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [31,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [32,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [33,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [34,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [35,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [36,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [37,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [38,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [39,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [40,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [41,] "Bacillales"          "Listeriaceae"         "Listeria"         
## [42,] "Bacillales"          "Listeriaceae"         "Listeria"         
## [43,] "Propionibacteriales" "Propionibacteriaceae" "Propionibacterium"
## [44,] "Lactobacillales"     "Enterococcaceae"      "Enterococcus"     
## [45,] "Bacteroidales"       "Bacteroidaceae"       "Bacteroides"      
## [46,] "Bacillales"          "Bacillaceae"          "Bacillus"         
## [47,] "Bacillales"          "Bacillaceae"          "Bacillus"         
## [48,] "Clostridiales"       "Clostridiaceae_1"     "Clostridium"      
## [49,] "Bacillales"          "Listeriaceae"         "Listeria"         
## [50,] "Pseudomonadales"     "Pseudomonadaceae"     "Pseudomonas"      
## [51,] "Bacillales"          "Listeriaceae"         "Listeria"

Plot abundances:

plot(dd$denoised)

plot(dd$denoised, log="y")

Broad range of abundances over 3 orders of magnitude (10-10000).

Check chiemras:

bim <- isBimeraDenovo(dd, minFoldParentOverAbundance=3.5, multithread=TRUE) 
# Higher MFPOA to avoid flagging intra-genomic variants
table(bim)
## bim
## FALSE 
##    51

No chimeras!

table(tax[,"Genus"])
## 
##     Acinetobacter          Bacillus       Bacteroides       Clostridium 
##                 1                 3                 1                15 
##      Enterococcus       Escherichia      Helicobacter     Lactobacillus 
##                 1                 6                 1                 1 
##          Listeria         Neisseria Propionibacterium       Pseudomonas 
##                 4                 1                 1                 2 
##       Rhodobacter    Staphylococcus     Streptococcus 
##                 1                10                 3

Save processed objects for future analysis.

saveRDS(dd, file.path(path.rds, "HMP_dd.rds"))
saveRDS(tax, file.path(path.rds, "HMP_tax_Silva128.rds"))

Reload processed data objects (can run code below from here in absence of input sequence data):

dd <- readRDS(file.path(path.rds, "HMP_dd.rds"))
tax <- readRDS(file.path(path.rds, "HMP_tax_Silva128.rds"))

Reference check againt nt

BLAST these 51 sequences against nt (https://blast.ncbi.nlm.nih.gov/Blast.cgi):

## dada2:::pfasta(dd, id=tax[,6])
uniquesToFasta(dd, file.path(path.rds, "LRAS_HMP_ASVs.fa"), ids=paste0(tax[,"Genus"], ";size=", dd$denoised, ";"))
is.ec <- tax[,"Genus"] %in% "Escherichia"
uniquesToFasta(dd$clustering[is.ec,], file.path(path.rds, "HMP_Ecoli.fa"), ids=paste0("Ecoli", seq(sum(is.ec)), "_HMP"))
## BLAST against nt excluding uncultured/environmental sequences (checkbox)

Results of BLAST search on July 19, 2018 are recorded.

  1. Exact match to Strep. mutans strains: NG8, UA159-FR, UA159, NN2025 DNA, ChDC YM216, ChDC YM97, ChDC YM64, ChDC YM220
  2. Exact match to Staph. epidermis ATCC 12228
  3. Exact match to Strep. mutans strains: UA159, ChDC YM89
  4. Exact match to many (~40) Staph. epidermis strains.
  5. Exact match to many (~20) Rhodobacter sphaeroides strains.
  6. Exact match to Staph. epidermis strains: FDAARGOS, RP62A
  7. Two mismatches (consecutive) to Staph. epidermidis strains: FDAARGOS_161, 1457, PM221 Query: 1235/6 AA, Refs: GG
  8. Exact match to Staph. epidermis strains: RP62A, ATCC 12228
  9. Exact match to many (>100) E. coli strains.
  10. Exact match to many (~50) Strep. agalactiae strains.
  11. Exact match to many (>100) E. coli strains.
  12. Exact match to many (~90) E. coli strains.
  13. Exact match to many (>100) E. coli strains.
  14. Exact match to many (~80) E. coli strains.
  15. Exact match to many (~50) E. coli strains.
  16. Exact match to Bacillus cereus strains: M3, LWYT0010, LWYT0301, FRI-35, PPB, ATCC 10987
  17. Exact match to many (>100) Staph. aureus strains.
  18. Exact match to many (99) Staph. aureus strains.
  19. Exact match to Helicobacter pylori strains: dRdM2addM2, 26695-dR, 26695-dRdM1dM2, 26695-dRdM2, dRdM1
  20. Exact match to many (~15) Staph. aureus strains.
  21. Exact match to Lactobacillus gasseri strains: ATCC 33323, Floraactive 18911-2, ATCC 33323, ATCC 33323, ATCC 33323
  22. Exact match to many (~70) Staph. aureus strains.
  23. Exact match to many (~60) Staph. aureus strains.
  24. Exact match to Neisseria meningitidis strains: FDAARGOS_210, LNP21362, H44/76, MC58
  25. Exact match to Clostridium beijerinckii strains: DSM 6423, ATCC 35702, N7, NCIMB 8052
  26. Exact match to many (>100) Pseudomonas aeruginosa strains.
  27. Exact match to Clostridium beijerinckii strains: BAS/B3/I/124, MF28, NCIMB 14988, ATCC 35702, NCIMB 8052
  28. Exact match to many (~95) Acinetobacter baumannii strains.
  29. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
  30. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
  31. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
  32. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
  33. Exact match to Clostridium sp. strain: MF28 (unnamed, but 1-off from C. beijerinckii, 2-ff from 35702/8052)
  34. One mismatch to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052, JCM 8023: Query 955T, Refs C
  35. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
  36. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
  37. One mismatch to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052, NCIMB 14988: Query 231C, Ref T OR Query 1209T, Refs C
  38. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
  39. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052, NRRL B-598
  40. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052
  41. Exact match to many (>100) Listeria monocytogenes strains.
  42. Exact match to many (~30) Listeria monocytogenes strains.
  43. Exact match to many (~15) Propionibacterium/Cutibacterium acnes strains. (P.acnes -> C.acnes name change)
  44. Exact match to many (~50) Enterococcus faecalis strains.
  45. Exact match to Bacteroides vulgatus ATCC 8482
  46. Exact match to Bacillus cereus ATCC 10987
  47. Exact match to Bacillus cereus ATCC 10987 (one mismatch in web BLAST due to not handling ambiguous nt code S in reference)
  48. Exact match to Clostridium beijerinckii strains: ATCC 35702, NCIMB 8052, JCM 8023
  49. Exact match to many (~70) Listeria monocytogenes strains.
  50. Exact match to Pseudomonas aeruginosa strains: ATCC 15692, PAO1OR, PAO1H2O, PAO1-VE13, PAO1-VE2, PAO581, c7447m
  51. Exact match to many (~30) Listeria monocytogenes strains.

Store these results in R format:

refhit.nt <- ifelse(seq_along(sq) %in% c(7, 34, 37), FALSE, TRUE)
which(!refhit.nt)
## [1]  7 34 37

48/51 of these are exact matches to sequenced and cultured isolates, over their entire ~1.5Kb length!

Actinomyces odontolyticus, Deinococcus radiodurans and Streptococcus pneumoniae were expected members of this mock community that were not detected in the data.

Let’s deconvolute a couple strains by assigning species based on BLAST results to genera with multiple species (Staphylococcus and Streptococcus):

tax[c(1,3),"Genus"] <- "Streptococcus mutans"
tax[10,"Genus"] <- "Streptococcus agalactiae"
tax[c(2,4,6,7,8),"Genus"] <- "Staphylococcus epidermis"
tax[c(17,18,20,22,23),"Genus"] <- "Staphylococcus aureus"

Reference check against 16S references

BEI does not provide reference sequences for these strains. However, Robert Edgar has generated a set of references sequences as best he can based on publicy deposited genomes, and we will use that here:

fn16S <- file.path("Docs/BEI_Reference_Edgar.fa")
refs <- getSequences(fn16S)
unique(sapply(strsplit(names(refs), "\\s"), function(x) paste(x[2], x[3])))
##  [1] "Streptococcus agalactiae"   "Lactobacillus gasseri"     
##  [3] "Bacillus cereus"            "Acinetobacter baumannii"   
##  [5] "Porphyromonas gingivalis"   "Actinomyces odontolyticus" 
##  [7] "Streptococcus pneumoniae"   "Neisseria meningitidis"    
##  [9] "Streptococcus mutans"       "Enterococcus faecalis"     
## [11] "Rhodobacter sphaeroides"    "Propionibacterium acnes"   
## [13] "Pseudomonas aeruginosa"     "Deinococcus radiodurans"   
## [15] "Bacteroides vulgatus"       "Escherichia coli"          
## [17] "Listeria monocytogenes"     "Staphylococcus aureus"     
## [19] "Clostridium beijerinckii"   "Methanobrevibacter smithii"
## [21] "Staphylococcus epidermidis" "Helicobacter pylori"

Twenty-two species, whereas only twenty expected in this HMP mock. The extra taxa are Methanobrevibacter smithii and Porphyromonas gingivalis, otherwise everything matches expectations.

Name the reference sequences something useful and count the number of unique (I think) variants in each:

ref.16S <- getSequences(fn16S)
genus.spc <- sapply(strsplit(names(ref.16S), "\\s"), function(x) paste(x[2], x[3], sep="_"))
variant <- sapply(seq_along(genus.spc), function(i) sum(genus.spc[1:i] == genus.spc[[i]]))
names(ref.16S) <- paste(genus.spc, variant, sep="_")
nunq.ref <- table(genus.spc)
nunq.ref
## genus.spc
##    Acinetobacter_baumannii  Actinomyces_odontolyticus 
##                          1                          1 
##            Bacillus_cereus       Bacteroides_vulgatus 
##                          7                          6 
##   Clostridium_beijerinckii    Deinococcus_radiodurans 
##                         14                          2 
##      Enterococcus_faecalis           Escherichia_coli 
##                          2                          6 
##        Helicobacter_pylori      Lactobacillus_gasseri 
##                          2                          1 
##     Listeria_monocytogenes Methanobrevibacter_smithii 
##                          4                          2 
##     Neisseria_meningitidis   Porphyromonas_gingivalis 
##                          1                          1 
##    Propionibacterium_acnes     Pseudomonas_aeruginosa 
##                          2                          2 
##    Rhodobacter_sphaeroides      Staphylococcus_aureus 
##                          2                          4 
## Staphylococcus_epidermidis   Streptococcus_agalactiae 
##                          5                          1 
##       Streptococcus_mutans   Streptococcus_pneumoniae 
##                          2                          1
table(tax[,"Genus"])
## 
##            Acinetobacter                 Bacillus              Bacteroides 
##                        1                        3                        1 
##              Clostridium             Enterococcus              Escherichia 
##                       15                        1                        6 
##             Helicobacter            Lactobacillus                 Listeria 
##                        1                        1                        4 
##                Neisseria        Propionibacterium              Pseudomonas 
##                        1                        1                        2 
##              Rhodobacter    Staphylococcus aureus Staphylococcus epidermis 
##                        1                        5                        5 
## Streptococcus agalactiae     Streptococcus mutans 
##                        1                        2

Now compare to DADA2 inferred sequences to the reference sequences, allowing no mismatches (but with fixed=FALSE so as to match ambiguous nucleotides in the references):

sq <- dd$sequence
refmat.16S <- sapply(DNAStringSet(sq), vcountPattern, subject=DNAStringSet(ref.16S), fixed=FALSE)
rownames(refmat.16S) <- names(ref.16S)
colnames(refmat.16S) <- paste0("Seq", seq_along(sq))

Inspect:

refhit.16S <- colSums(refmat.16S)>0
which(!refhit.16S)
##  Seq4  Seq6  Seq7 Seq19 Seq20 Seq33 Seq34 Seq37 
##     4     6     7    19    20    33    34    37
table(tax[,"Genus"], Reference.Match=refhit.16S)
##                           Reference.Match
##                            FALSE TRUE
##   Acinetobacter                0    1
##   Bacillus                     0    3
##   Bacteroides                  0    1
##   Clostridium                  3   12
##   Enterococcus                 0    1
##   Escherichia                  0    6
##   Helicobacter                 1    0
##   Lactobacillus                0    1
##   Listeria                     0    4
##   Neisseria                    0    1
##   Propionibacterium            0    1
##   Pseudomonas                  0    2
##   Rhodobacter                  0    1
##   Staphylococcus aureus        1    4
##   Staphylococcus epidermis     3    2
##   Streptococcus agalactiae     0    1
##   Streptococcus mutans         0    2

So 43/51 exactly match against this 16S reference based on the described strains. The 8 that don’t hit include 3/5 S. epidermis sequence variants, 1/5 S. aureus variants, 1/1 H. pylori variant, and 3/15 C. bejerenckii variants. They overlap with 1/5 S. epidermis variants, and 2/15 C. bejerenckii variants that also were not matched against nt.

Intra-genomic Stoichiometry

Now we’ll investigate the “stoichiometry” of the sequence variants from each bacterial strain, i.e. the copy number of each 16S sequence variants relative to the genomic copy number, all inferred from the dada2 processed data and the rRNA gene copy numbers given for each strain in the Zymo reference.

These 16S copy numbers are based on (1) the NCBI reference sequence annotation if present, (2) an estimate from https://rrndb.umms.med.umich.edu/search/, and (3) inspection of the number of 16S rRNA genes in the best BLAST hits of each strain’s ASVs. If copy number was ambiguous, the best BLAST hits were chosen to break the tie, as likely to most closely capture the actual strain being interrogated here.

ncopy <- c("Acinetobacter" = 6, 
           "Actinomyces"=2, 
           "Bacillus" = 12, 
           "Bacteroides" = 7, 
           "Clostridium" = 14, 
           "Deinococcus" = 3, 
           "Enterococcus" = 4, 
           "Escherichia" = 7,
           "Helicobacter" = 2, 
           "Lactobacillus" = 6, 
           "Listeria" = 6, 
           "Neisseria" = 4, 
           "Propionibacterium" = 3, 
           "Pseudomonas" = 4, 
           "Rhodobacter" = 4, 
           "Staphylococcus aureus" = 6,
           "Staphylococcus epidermis" = 6, 
           "Streptococcus agalactiae" = 7, 
           "Streptococcus mutans" = 6,
           "Streptococcus pneumoniae" = 4)

First calculate the genomic abundances, by summing over all sequence variants for each genus and dividing by the 16S copy number:

gens <- sort(unique(tax[,"Genus"])) # Just the ones that were detected
abund.ome <- sapply(gens, function(gen) {
  is.gen <- grepl(gen, tax[,"Genus"])
  sum(dd$denoised[is.gen])/ncopy[gen] ###!
###  sum(dd$clustering$n0[is.gen])/ncopy[gen] ###! Alternative w/o error-correction
})
names(abund.ome) <- gens
dfgen <- data.frame(Genus=gens, Abundance=abund.ome, stringsAsFactors = FALSE)
p.abundome <- ggplot(data=dfgen, aes(x=Genus, y=Abundance)) + 
  theme(axis.text.x=element_blank(), axis.ticks.x = element_blank()) +
  ylab("Genome Abundance") + xlab("Strains")
p.abundome + geom_col(width=0.4, aes(fill=Genus)) # scale_fill_manual(values=genusPalette) +

p.abundome.nice <- p.abundome+ geom_point(aes(color=Genus), shape="diamond", size=4) +
  scale_y_log10(limits=c(0.3, NA), breaks=c(1,10,100,1000)) 
p.abundome.nice

ggsave(file.path(path.out, "HMP_GenomeAbundances.pdf"), p.abundome.nice,
       width=6, height=4, units="in", useDingbats=FALSE)

Now we make a similar plot for the abundance of each sequence variant:

dfasv <- data.frame(Genus=tax[,"Genus"], 
                    Abundance=dd$denoised, ###!
                    ### Abundance=dd$clustering$n0, ###! Alternative w/o error-correction
                    stringsAsFactors = FALSE)
rownames(dfasv) <- NULL
p.abundasv <- ggplot(data=dfasv, aes(x=Genus, y=Abundance)) + 
  geom_point(aes(color=Genus), shape="x", size=4) + 
  ylim(c(0, NA)) +
  theme(axis.text.x=element_blank(), axis.ticks.x = element_blank()) +
  ylab("ASV Abundance")
p.abundasv

p.abundasv + ylim(0, 500)
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 17 rows containing missing values (geom_point).

summary(dfasv$Abundance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    13.0    96.5   137.0  1370.8  1785.5 14640.0
summary(dfasv$Abundance)/sum(dfasv$Abundance)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0001859 0.0013803 0.0019596 0.0196078 0.0255396 0.2094091

Yep, broad range of abundances over 3 orders of magnitude at the genome and sequence-variant level. Concentrations of DNA from each strain, as assyaed by Qubit, are in the BEI Certificat of Analysis doc:

qubit <- c("Acinetobacter" = 0.010, 
           "Actinomyces"=0.001, 
           "Bacillus" = 0.045, 
           "Bacteroides" = 0.001, 
           "Clostridium" = 0.044, 
           "Deinococcus" = 0.001, 
           "Enterococcus" = 0.001, 
           "Escherichia" = 0.680,
           "Helicobacter" = 0.009, 
           "Lactobacillus" = 0.003, 
           "Listeria" = 0.005, 
           "Neisseria" = 0.006, 
           "Propionibacterium" = 0.009, 
           "Pseudomonas" = 0.161, 
           "Rhodobacter" = 1.413, 
           "Staphylococcus aureus" = 0.059, 
           "Staphylococcus epidermis" = 0.513, 
           "Streptococcus agalactiae" = 0.032, 
           "Streptococcus mutans" = 0.417, 
           "Streptococcus pneumoniae" = 0.001)

Note that the dropouts (S. pneumoniae, Deinococcus and Actinomyces) all had the lowest Qubit measurements (0.001), which may be a minimum assigned value rather than a true value anyway (e.g. ND). These were too rare to detect.

plot(qubit[gens], abund.ome[gens], log="xy")

Decent correspondence.

And now we make the “stoichiometry” figure, i.e the abundance of each ASV scaled to the genomic abundance:

dfasv$ScaledAbundance <- dfasv$Abundance/abund.ome[dfasv$Genus]
# Number the ASVs in each strain/genus
dfasv$Variant <- sapply(seq(nrow(dfasv)), function(i) sum(dfasv$Genus[1:i] == dfasv$Genus[[i]], na.rm=TRUE))

p.stoich <- ggplot(data=dfasv, aes(x=Variant, y=ScaledAbundance, fill=Genus, width=0.5)) + geom_col() + 
  facet_wrap(~Genus, nrow=3) +
  scale_y_continuous(breaks=seq(0, round(max(dfasv$ScaledAbundance))), minor_breaks=NULL) +
  theme(panel.grid.major.y=element_line(color="black", size=0.25)) +
  theme(panel.grid.major.x=element_blank(), panel.grid.minor.x=element_blank()) +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
  xlab("Full-length 16S Sequence Variants") + 
  ylab("Abundance (per-genome)")
p.stoich

ggsave(file.path(path.out, "HMP_Stoich.pdf"), width=12, height=7, units="in")

Now let’s put the genome abundances together with the integers in a simpler plot:

p.stoich.nice <- ggplot(data=dfasv, aes(x=Genus, y=ScaledAbundance, color=Genus, width=0.5)) + 
  geom_point(shape="x", size=4) + 
  scale_y_continuous(breaks=seq(0, round(max(dfasv$ScaledAbundance))), minor_breaks=NULL) +
  theme(panel.grid.major.y=element_line(color="black", size=0.25)) +
  theme(panel.grid.major.x=element_blank(), panel.grid.minor.x=element_blank()) +
  theme(axis.text.x=element_blank(), axis.ticks.x=element_blank()) +
  xlab("Full-length 16S Sequence Variants") + 
  ylab("ASV Abundance (per-genome)")
p.stoich.nice

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:BiocGenerics':
## 
##     combine
p.nice <- arrangeGrob(p.abundome.nice, p.stoich.nice + guides(color=FALSE), nrow=1, widths=c(0.6, 0.4))
plot(p.nice)

ggsave(file.path(path.out, "HMP_AbundanceAndStoich.pdf"), p.nice,
                 width=10, height=4, units="in", useDingbats=FALSE)

B. cereus a little off, and Listeria a little off. Others all look very good (although many are just one variant, so trivial integral-ness).

Some is due to low numbers as well, e.g Listeria seems largely down to counting noise:

is.gen <- tax[,"Genus"] == "Listeria"
foo <- data.frame(denoised=dd$denoised[is.gen], n0=dd$clustering$n0[is.gen], exact=drp$uniques[dd$sequence[is.gen]])
rownames(foo) <- NULL
foo
##   denoised n0 exact
## 1       80 65    40
## 2       60 50    37
## 3       33 28    19
## 4       41 35    27

Diff between 1/2 increases a lot as reads are corrected. Might just be noise though, not huge numbers here.

However Bacillus is difficult to explain by counting noise. Could the rrn copy number be wrong? All three variants exactly match previously observed sequences so these ASVs being uncorrected sequencing errors seems unlikely.

For completeness, let’s calculate the residual error rate if those low-frequency B. cereus ASVs are errors:

is.bc <- tax[,"Genus"] %in% "Bacillus"
sq.bc <- dd$sequence[is.bc]
# Candidate errors are the 2nd and 3rd B cereus ASVs
tot.err <- sum(nwhamming(sq.bc[[1]], sq.bc[2:3])*dd$denoised[sq.bc[2:3]])
tot.nt <- sum(dd$denoised * nchar(dd$sequence))
tot.err/tot.nt
## [1] 2.611897e-06