The data analyzed here is the S. aureus sample analyzed in the paper “Evaluation of PacBio sequencing for full-length bacterial 16S rRNA gene classification”, Josef Wagner, Paul Coupland, Hilary P. Browne, Trevor D. Lawley, Suzanna C. Francis and Julian Parkhill, BMC Microbiology, https://doi.org/10.1186/s12866-016-0891-4
They performed full-length 16S rRNA gene sequencing using “forward primers (7f 5’ AGAGTTTGATYMTGGCTCAG 3’)” and “reverse primer (1510r 5’ ACGGYTACCTTGTTACGACCT 3’)”. After standard PacBio processing, further bionformatics used the mothur software package.
They report: “We also identified that substitution errors were not evenly distributed across the full-length bacterial 16S rRNA gene but were detected at hotspots across the full-length bacterial 16 s rRNA gene (Fig. 2)… Sequencing error was determined with Mothur using Staphylococcus aureus M1 as a reference sequence.” Blue squares are inferred subsitution error rates under their high stringency processing parameters (green triangles are deletions).
Of note, in the Genome Announcement of S. aureus M1, https://doi.org/10.1128/genomeA.00336-13, they state: “it [was] possible to close all gaps except four of five ribosomal clusters. These ribosomal clusters are almost identical, based on their 5-fold higher coverage than the genome.” That is, only one copy of each rrn gene was in the M1 reference genome.
Also, from correspondence:
I got the S. aureus DNA from my work colleague few years ago. I have used the DNA since then for my PCR and sequencing positive controls. I did some search in my lab books and it is a TW20 strain.
Here we revisit the analysis of the PacBio S. aureus dataset in Wagner et al, with a particular focus on their findings of high rates of substition errors at specific positions in the full-length 16S rRNA gene.
Using a couple versions of the ccs processed file
library(dada2);packageVersion("dada2")
## [1] '1.12.1'
library(ggplot2); packageVersion("ggplot2")
## [1] '3.1.0'
path <- "~/Desktop/LRAS/Data/Saureus" # CHANGE ME to directory containing the fastq file
fn <- file.path(path, "m150206_s1_p0_1_ccs_minpass_2_minprdaccry_99.fastq.gz")
F7 <- "AGAGTTTGATYMTGGCTCAG"
R1510 <- "ACGGYTACCTTGTTACGACCT"
rc <- dada2:::rc
Removing primers and extraction region in between for each file:
nop <- file.path(path, "noprimers", basename(fn))
dada2:::removePrimers(fn, nop, primer.fwd=F7, primer.rev=rc(R1510), orient=TRUE, verbose=TRUE)
## Multiple matches to the primer(s) in some sequences. Using the longest possible match.
## Multiple matches to the primer(s) in some reverse-complement sequences. Using the longest possible match.
## 3476 sequences out of 9541 are being reverse-complemented.
## Overwriting file:/Users/bcallah/Desktop/LRAS/Data/Saureus/noprimers/m150206_s1_p0_1_ccs_minpass_2_minprdaccry_99.fastq.gz
## Read in 9541, output 6327 (66.3%) filtered sequences.
Filter:
filt <- file.path(path, "noprimers", "filtered", basename(fn))
track <- filterAndTrim(nop, filt, minQ=3, rm.phix=FALSE, maxEE=2, verbose=TRUE)
## Overwriting file:/Users/bcallah/Desktop/LRAS/Data/Saureus/noprimers/filtered/m150206_s1_p0_1_ccs_minpass_2_minprdaccry_99.fastq.gz
## Read in 6327, output 4056 (64.1%) filtered sequences.
So the filtering pretty much knocks down the min predicted accuracy thing.
Dereplicate:
drp <- derepFastq(filt, verbose=TRUE)
## Dereplicating sequence entries in Fastq file: ~/Desktop/LRAS/Data/Saureus/noprimers/filtered/m150206_s1_p0_1_ccs_minpass_2_minprdaccry_99.fastq.gz
## Encountered 1457 unique sequences from 4056 total sequences read.
Learn errors:
err <- learnErrors(drp, errorEstimationFunction=PacBioErrfun, BAND_SIZE=32, multithread=TRUE)
## 5964392 total bases in 4056 reads from 1 samples will be used for learning the error rates.
Denoise:
dd <- dada(drp, err=err, BAND_SIZE=32, multithread=TRUE)
## Sample 1 - 4056 reads in 1457 unique sequences.
dd
## dada-class: object describing DADA2 denoising results
## 7 sequence variants were inferred from 1457 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 32
dd$clustering[,-1]
## abundance n0 n1 nunq pval birth_type birth_pval birth_fold
## 1 1521 928 488 584 0.000000e+00 I NA NA
## 2 623 449 146 187 0.000000e+00 A 0.000000e+00 8.555378e+01
## 3 658 432 185 191 0.000000e+00 A 0.000000e+00 1.121128e+02
## 4 525 374 114 199 0.000000e+00 A 0.000000e+00 3.743549e+04
## 5 477 218 191 197 0.000000e+00 A 0.000000e+00 2.743040e+12
## 6 135 97 29 41 3.492111e-63 A 4.285168e-66 1.825145e+01
## 7 117 36 62 58 2.563083e-57 A 7.432817e-54 3.998664e+02
## birth_ham birth_qave
## 1 NA NA
## 2 1 55.00
## 3 1 54.00
## 4 2 55.50
## 5 4 74.75
## 6 1 38.00
## 7 2 33.00
The bottom 2 are marginal. Check for chimeras.
unname(isBimeraDenovo(dd))
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE
Chimeras. So, in this dataset we believe there are 5 true sequences.
BLAST the full set of 7 against nt.
##dada2:::pfasta(dd)
rrndb lists S. aureus as having 5 or 6 copies of the 16S gene almost always: https://rrndb.umms.med.umich.edu/
Now consider the TW20 genome from “Genome Sequence of a Recently Emerged, Highly Transmissible, Multi-Antibiotic- and Antiseptic-Resistant Variant of Methicillin-Resistant Staphylococcus aureus, Sequence Type 239 (TW)”.
The sequence and annotation of the TW20 genome have been deposited in the EMBL database under the accession numbers FN433596, FN433597, and FN433598."
Staphylococcus aureus subsp. aureus TW20, taxid:663951, accession:FN433596.1 Plasmid (large): FN433597.1; Plasmid (small): FN433598.1
In a BLAST against TW20 (accession FN433596.1) 1/3/4/5 are exact matches. sq2 has 1 mismatch. No 16S hits to the plasmids. TW20 is listed as having 5 copies of the 16S rRNA gene: https://rrndb.umms.med.umich.edu/genomes/GCF_000027045.1
In summary, these sequences look like they come from TW20, or a very close relative (w/ one rrn gene changed).
In the Genome Announcement of S. aureus M1, https://doi.org/10.1128/genomeA.00336-13:
The [M1] genome data have been deposited in GenBank with accession number HF937103 for the chromosome and HF937104 for the plasmid."
In a BLAST againt M1 (accession HF937103) 1 is an exact match, none of the others are. Also, only one rrn gene is found in the reference.
Let us consider what we think are the true sequences, and plot their mismatches against sq1 (i.e. the M1 reference gene) and see if we can recapitulate Fig 2C.
Collect the mismatches between each sequence and the most abundant sequence.
sq <- dd$sequence[2:5]
ref <- dd$sequence[[1]]
get_subs <- function(query, ref) {
al <- nwalign(query, ref)
subs <- dada2:::strdiff(al[[1]], al[[2]])
colnames(subs) <- c("pos", "sub", "ref")
subs$abund <- dd$denoised[query]
subs
}
subdf <- do.call(rbind, lapply(sq, get_subs, ref=ref))
Plot a version of Figure 2C based on the intra-ASV mismatches.
tot.nt <- sum(nchar(dd$sequence) * dd$clustering$abundance)
ggplot(data=subdf, aes(x=pos, y=abund/tot.nt)) + geom_point(color="blue", shape="square", size=2) +
scale_y_log10(limits=c(0.00001,1), breaks=4^c(-7, -6, -5, -4, -3, -2, -1, 0),
labels=c("0.000061", "0.000244", "0.000977", "0.003906", "0.015625", "0.0625", "0.25", "1")) +
scale_x_continuous(limits=c(0, 1500), breaks=c(0, 300, 600, 900, 1200, 1500)) +
theme(panel.grid=element_blank(), panel.background=element_rect(fill="black")) +
ylab("Sequence error rate") + xlab("Base pair position in 16S gene")
Yep, that’s the same plot (modulo the edge effect sub peak at ~1500, and the deletions). The non-random substitution hotspots at certain positions are just due to the incomplete reference genome with just 1 rrn operon.